Real-world Python code 700 times slower than C

Fri Jan 4 19:47:25 EST 2002

I often use a "10x" rule of thumb for comparing Python to C, but I
recently hit one real-world case where Python is almost 700 times
slower than C!  We just rewrote the routine in C and moved on, but
this has interesting implications for Python optimization efforts.

python
------
def Ramp(result, size, start, end):
    step = (end-start)/(size-1)
    for i in xrange(size):
        result[i] = start + step*i

def main():
    array = [0]*10000
    for i in xrange(100):
        Ramp(array, 10000, 0.0, 1.0)

main()

c version
---------
void Ramp(double* result, int size, double start, double end)
{
    double step = (end-start)/(size-1);
    int i;
    for (i = 0; i < size; i++)
	*result++ = start + step*i;
}

void main()
{
    double array[10000];
    int i;
    for (i = 0; i < 100000; i++)
	Ramp(array, 10000, 0.0, 1.0);
}

We use a Ramp function similar to this to generate rgb swatches for a
color picker.  There are many, possibly large color swatches that are
updated on every mouse event and performance was unacceptable in the
pure Python version.  There are also 2d and circular swatches that
would be even worse if coded in Python.

The Python version runs in 7.7 seconds.  The C version runs in 11.3
seconds, but loops 1000 times at much.  The ratio is therefore
7.7*1000/11.3 = 681 (or 68100%).

As expected, 99.9% of the time is spent in eval_code2, the main
interpreter loop.  Within the loop, the profile is:

--- General loop overhead ---
switch(opcode)    .66 sec
HAS_ARG           .48 sec
tstate->ticker    .42 sec
NEXTARG           .30 sec
NEXTOP            .15 sec

--- Individual opcodes ---
FOR_LOOP:        1.59 sec
BINARY_MULTIPLY: 1.02 sec
LOAD_FAST:        .99 sec
BINARY_ADD:       .96 sec
STORE_SUBSCR:     .78 sec
STORE_FAST:       .15 sec
SET_LINENO:       .09 sec
JUMP_ABSOLUTE:    .06 sec

For comparison, consider that the entire equivalent C program runs in
.01 sec (when you equalize the number of iterations).  That means that
just running the switch(opcode) statement takes 66 times as long as
all the C code.

All the proposals I've seen for Python optimization are aimed at
general speedups.  That's fine, but a 50% (or even 90%) speedup won't
help much when your code is 500 times slower than it needs to be.  I
don't think that even JIT native code compilation will help much in
this case because of Python's dynamic nature.

I like the approach that the Perl Inline module takes where you can
put C code directly inline with your Perl code and the Inline module
compiles and caches the C code automatically.  However the fact that
it's C (with all of its safety and portability problems) and the fact
that it relies on a C compiler to be properly installed and accessible
make this approach unappealing for general use.

What I really want is something spiritually equivalent to a portable
inline assembly language with python-ish syntax that generates really
fast native code and seamlessly integrates with python.  I can dream
can't I?

--

As an aside, there's another interesting bottleneck we hit in our
production code.  We're reading a lookup table from a text file (for
doing image display color correction) that consists of 64K lines with
3 integers on each line.  The python code looks something like:

rArray = []
gArray = []
bArray = []
for line in open(lutPath).xreadlines():
    entry = split(line)
    rArray.append(int(entry[0]))
    gArray.append(int(entry[1]))
    bArray.append(int(entry[2]))

There are all kinds of ways to optimize this a little bit, but there
doesn't seem to be a way to make it acceptably fast.
map(int,open(path).read().split()) gets you pretty close, but
deinterleaving is still slow.  The C version ended up being several
hundred times faster.

Brent Burley