One Python 2.1 idea

Tue Dec 26 21:56:07 EST 2000

[Neelakantan Krishnaswami <neelk at alum.mit.edu
 news:slrn94ie9k.anf.neelk at alum.mit.edu]
> ...
> Fortunately, I don't have to put my money where my mouth is, since
> someone has already done it for me. :) Vladimir Marangozov has
> modified the Python interpreter (as of 1.5) to use the GNU computed
> labels extension and implemented direct threading.
>
>   http://starship.python.net/crew/vlad/archive/threaded_code/
>
> Bringing that up to date with 2.0 would be an interesting experiment.

The extreme simplicity of the patch suggests it's not doing much -- and that
suspicion is intensified due to Vladimir's utter lack of quantitative claims
for it <wink>.

The page Vlad references:

    http://www.complang.tuwien.ac.at/forth/threaded-code.html

identifies Python's current mechanism as "Switch Threading", and notes:

    The result has the same advantages as token threaded code; the
    disadvantage over token threaded code is the lower speed due to
    the range check generated by most (all?) C compilers for the switch

Other than that, the mechanisms appear identical by inspection:  both use
the opcode as an index into a table of addresses (in the current "switch"
case, that table is constructed by the compiler; in Vlad's case, by hand),
then jump to the address so loaded.  Don't know how lame gcc is <wink>, but
under MSVC for the Pentium the range check (to jump to the "default:" case
if the index would exceed the table size) amounts to a register move with
decrement, compare to immediate constant, and a non-taken jump.  If that is
indeed all the savings there is to be gotten via this technique, it's
marginal at best.

Note that at the 1998 Python Conference, John Aycock reported on an extreme
technique that can eliminate *all* eval-loop overhead, including instruction
decoding and dispatch costs; look for "Converting Python Virtual Machine
Code to C" at

    http://www.foretec.com/python/workshops/1998-11/proceedings.html

The results were inconclusive (detecting a pattern here <wink>?), but it
wasn't encouraging that pystone only got a 10% speed boost.  That did and
does seem anomalously small, but so long as "+" has to do pages of analysis
at runtime, the time to *find* the BINARY_ADD code really isn't important.
Threading techniques grew up in Forth, where the implementation of a typical
opcode consists of just a handful of machine instructions.

god-forbid-anyone-profile-the-code-and-find-out-where-it's-really-
    spending-time<wink>-ly y'rs  - tim