[pypy-dev] MalGen as a benchmark?

Sat Sep 29 01:36:45 CEST 2012

Found a red-hot, branchy-looking Python kernel in the wild and
naturally I thought of you trace compiler folks! ;-) Hope that it
might be useful: I think it could make a nice addition to the speed
center, seeing as how it's a CPU bound workload on all the machines I
have access to (though I haven't profiled it at all so it could
potentially be leaning heavily on paths in some unoptimized builtins).

    MalGen is a set of scripts which generate large, distributed data
sets suitable for testing and benchmarking software designed to
perform parallel processing on large data sets. The data sets can be
thought of as site-entity log files. After an initial seeding, the
scripts allow for the data generation to be initiated from a single
central node to run the generation concurrently on multiple remote
nodes of the cluster.

    -- http://code.google.com/p/malgen/

Specifically, http://code.google.com/p/malgen/source/browse/trunk/bin/cloud/malgen/malgen.py
which gets run thusly:

::

    pypy malgen.py -O /tmp/ -o INITIAL.txt 0 50000000 10000000 21

(Where 5e7 is the "initial block size" and 1e7 is the
other-than-inital block size.) This generates the initial seeding they
were talking about, followed by a run for each of N blocks on each
node (in this hypothetical setup, for 5 blocks on each of four nodes
the following is run):

::

    pypy malgen.py -O /tmp [start_value]

The metadata is read out of the INITIAL.txt file and used to determine
the size of the block, and the parameter [start_value] is used to bump
to the appropriate start id count for the current block.

Inner loop: http://code.google.com/p/malgen/source/browse/trunk/bin/cloud/malgen/malgen.py#90

Thoughts?

- Leary