Optimizing C module

Thu Feb 10 13:37:00 EST 2000

I think I see a customer for SLP coming :-)

Paul Prescod wrote:
> 
> The slowest thing in both PyExpat and Perl's equivalent is the "cross
> over" between C and Python. I can think of a couple of ways to speed
> that up but I want to know if they are feasible.

The same applies to the SGMLOP module.

> The basic pattern of the code I need to speed up is:
> 
> sometype Foo( somearg1, somearg2, ... ){
>         sometype real_rc;
>         pyargs = Py_BuildVale( someformat, somearg1, somearg2, ... );
> 
>         rc = PyEval_CallObject( callback, pyargs );
>         Py_XDECREF( pyargs );
> 
>         real_rc = SomeConversion( rc );
>         Py_XDECREF( rc );
>         return real_rc;
> }

Well, if you want to be fast with callbacks, Py_BuildValue
and PyArg_ParseTuple have to replaced by faster direct stuff
anyway. See for instance the builtin len(). This buddy
spend 90% of its time in PyArg_ParseTuple. :-)

> The Py_BuildValue can't be so fast because it is parsing strings and
> creating new heap objects. I'm wondering if I could keep a fixed-length
> args tuple in my back pocket and just fill in the details for each
> callback. I would really love to hang on to the string and int objects
> in the array too. How about I could check the refcount after the
> function call and only generate a "new one" if the refcount is >1. If
> the refcount is 1, I would mutate the string and int objects under the
> covers.

Good approach, but maybe it is even easier...

> The next question is whether I can go even farther. Is there something I
> can safely cache that is created in PyEval_CallObject and/or
> eval_code2() (e.g. frame objects) so that the whole thing is just a
> matter of setting a couple of values and jumping?
> 
> I'm sure stackless Python has something to do with all of this but I
> missed the talk at IPC8. It looks to me like eval_code2 would need to be
> broken up to allow me to set up and reuse my own frame object. That's
> probably part of what stackless does (the part I want!).

That's exactly what Stackless Python does!
The setup of a code object with its frame is one step.
Then this frame is thrown at a dispatcher, and the dispatcher
runs an evaluator loop on the topmost frame in tstate. It does
so until the frame is stopped. Then it runs the next frame
and so on. This is like "dribbeling".

Now, where I think things can be much easier and faster is this:
Instead of always calling a function again and again, you call
it once and catch its continuation. The function is turned
into an endless loop, and you resume it by a coroutine transfer.
This has a single parameter, enough in many cases. All the local
variables are still alive, so there is no need to prepare new
complicated argument lists.

Going to the extremes, it would even be possible to modify
the locals of the frame directly. If this is desirable,
I can add an interface to this.

Turning the C code into a stackless version would be the
ultimate thing. I'm planning so for SGMLOP. This would mean
to turn it into an evaluator function by itself, it would get
a primitive frame, and it would be dispatchable like any
other python frame. This would solve the consumer/producer
problem, and you could the whole thing as a couple of
coroutines.
But this isn't necessary if you just want it fast. By saving
a couple of continuations of the necessary Python functions,
you can dispatch between contexts at very high speed.

If you want me to help here, let me know.

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Düppelstr. 31                :    *Starship* http://starship.python.net
12163 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home