Concerns about performance w/Python, Pysco on Pentiums

Wed Mar 5 09:50:27 EST 2003

(This is an exploratory inquiry to see if anyone has suggestions
for things I can experiment with, or thoughts on what I might
be doing wrong.  If necessary, I will later be able to post more
concrete information.)

We've spiked a simulator for the Motorola 68HC12 microcontroller.
The code is pure Python and, other than an initial data load,
there is no I/O until the program completes, and no threads, no 
GUI, and no extension modules or anything except pure Python.  
(For those who know about these things, we're loading an .S19 
file with an image of the HC12 code, and simulating the CPU core
to dispatch on individual opcodes).  

On a Pentium 266 MMX, the simulator executes roughly 15000
simulated HC12 clock cycles per second.  On a P3 866 MHz 
chip, it can do about 85000 cycles per second.  On a P4 2GHz
it can do about 115000 cycles per second.  The real CPU 
runs 8 million of these clock cycles per second, and we
were hoping for significantly better performance than we've
seen so far.

More interesting to me, however, is the poor relative performance
of the faster machines.  I can believe the P3 866 should be about
5.5+ times faster than the old P266, but the P4 is only 35%
faster than it!  (Note, the P3 is running Win98SE, the P4 is running
Redhat 7.3, both with their "vanilla" Python 2.2 installations,
in the case of Linux that being the RPM from python.org.)

I had high hopes for Psyco, so I installed it on the P266MMX
machine and took a first stab at binding the core functions
(basically the dispatch routine, plus all opcode functions)
but achieved only a 12% speedup, substantially below my hopes
and expectations based on others' reports using Psyco.

The core code consists of a loop which grabs a byte, does
a dictionary lookup to find the opcode function to call, and
calls it, passing in a reference to the CPU object.  Other
than twiddling bits, doing a lot of "& 0xFFFF" operations, and
the odd addition or multiplication, not much is going on.  
At a first approximation, I'd guess most of the time is going
into function calls (I'll profile at some point of course).

So my questions are these:

1. Any thoughts on why the Linux-based P4 2GHz machine is so 
   pathetically little faster than a machine it ought to be 
   twice as fast as?  Is it because we're running code (maybe
   both Linux and Python) that isn't optimized for Pentiums?
   In that case, why is Win98SE so much faster?  Does it self-
   adjust for faster CPUs, installing optimized modules when
   it detects a non-386 chip?

2. Any thoughts on why Psyco provides such a small speedup?
   Is it likely I'm using it wrong?  Or is it ineffective on
   code where the bottleneck is Python function calls?  Should
   I consider Pyrex instead?

At the moment, it's actually "fast enough", so this isn't an
urgent concern.  On the other hand, our intention is to use this
simulator to allow true test-driven development of embedded 
system code (which, I believe, may well be a "first"), and 
as we grow the number of tests we will doubtless become interested
in better performance.  With the P3 machine we can run at 1/100 
the native CPU speed, but I'd like to see something an order of 
magnitude faster.  In fact, my initial estimate was that on the 
fast CPUs and with Psyco we could probably achieve parity (using a 
2GHZ chip and Python to simulate a lowly 16MHz chip) but I'm losing 
hope on that one.

Any input is welcome.  By the way, it's my firm intention that
the simulator and (I hope) the test framework itself will be
released as an open-source project.  (Note also, that the entire
simulator and framework is itself being test-driven, so it will 
come complete with a full suite of unit and acceptance tests.)

Thanks.

-Peter