> Now that I can run benchmarks against Python 2.7 and 3.3 simultaneously,
> I'm ready to start updating the benchmarks. This involves two parts.
> One is moving benchmarks from PyPy over to the unladen repo on
> hg.python.org/benchmarks. But I wanted to first make sure people don't
> view the benchmarks as immutable (e.g. as Octane does:
> https://developers.google.com/octane/faq). Since the benchmarks are
> always relative between two interpreters their immutability isn't critical
> compared to if we were to report some overall score. But it also means that
> any changes made would throw off historical comparisons. For instance, if I
> take PyPy's Mako benchmark (which does a lot more work), should it be named
> mako_v2, or should we just replace mako wholesale?

I dislike benchmark immutability.  The rest of the world including your
local computing environment where benchmarks run continues to change around
benchmarks which really makes using historical benchmark data from a run on
an old version for comparison to a recent modern run pointless.

What is needed more is benchmark *rerunability* and *repeatability*.  So
that an old version of a Python implementation can be built and run the
current benchmark suite today within the exact same environment as a
current version of a python implementation.  They key is that they ran the
same thing on the same hardware in the same configuration at around the
same time.

Nothing else is a valid comparison as too many untracked unquantified
variables have changed in the interim.

Where the above clearly fails: creating historical trend graphs.  If you
want a setup that runs the benchmarks after every commit, or at least runs
them as continuously as possible _that_ benchmark suite needs to be as
immutable as possible.  The machine on which they are run also needs to be
locked down to have no updates applied and nothing else running on it
*ever*. Whenever either the bechmark suite or the historical trend
benchmark running os, distro or hardware is mutated it needs to be clearly
noted so deltas at that time in the results can be flagged to mark a
discontinuity in the trend data as being due to the external changes. ONE
way to do this is always version benchmark names.  Any time one is updated,
give it a new versioned name so it can't be compared with past results.

Otherwise for historical data, periodically rerunning the benchmark suite
on older versions (releases and betas) for use in modern comparisons is


> And the second is the same question for libraries. For instance, the
> unladen benchmarks have Django 1.1a0 as the version which is rather
> ancient. And with 1.5 coming out with provisional Python 3 support I
> obviously would like to update it. But the same questions as with
> benchmarks crops up in reference to immutability. Another thing is that
> 2to3 can't actually be ported using 2to3 (
> http://bugs.python.org/issue15834) and so that itself will require two
> versions -- a 2.x version (probably from Python 2.7's stdlib) and a 3.x
> version (from the 3.2 stdlib) -- which already starts to add interesting
> issues for me in terms of comparing performance (e.g. I will have to
> probably update the 2.7 code to use io.BytesIO instead of StringIO.StringIO
> to be on more equal footing). Similar thing goes for html5lib which has
> developed its Python 3 support separately from its Python 2 code.
> If we can't find a reasonable way to handle all of this then what I will
> do is branch the unladen benchmarks for 2.x/3.x benchmarking, and then
> create another branch of the benchmark suite to just be for Python 3.x so
> that we can start fresh with a new set of benchmarks that will never change
> themselves for benchmarking Python 3 itself. That would also mean we could
> start of with whatever is needed from PyPy and unladen to have the optimal
> benchmark runner for speed.python.org.
