Opcode Tracker for the Python Byte-Code Interpreter
============================================================

by Perry A. Stoll
   <pas@xis.xerox.com>

   Created for Python 1.3, January 1996
   Updated for Python 1.4, January 1997


Motivation
---------------
There has always been interest (on the python newsgroup, at least) in
improving the performance of Python. Two possible strategies are
compiling the bytecode to machine-specific binaries or improving the
performance of the interpreter. Every so often someone comes up with a
clever idea which they think should make a difference in the
performance of the interpreter. Often, people step up and put their
code where their mouth is, implementing the suggested
change. Invariably, the call is sounded for a way to measure the
affect of such changes. That's where this opcode tracker fits in.  It
helps identify where the Python interpreter is spending its time.


What
---------------
This distribution contains the necessary tools to add the ability to
profile bytecodes run by the interpreter and generate statistics about
them. The files included in this distribution are:

* README - this file
* INSTALL - instructions to install this package for Unix
* NEWS - the original comp.lang.python article which inspired this stuff.
* ceval.optrack.patch - a small context diff file used to patch
    Python/ceval.c, adding the ability to keep statistics (total count,
    time spent executing, execution chain distributions). To enable opcode
    tracking, you must then compile ceval.c with ENABLE_OPTRACKER defined.
* optracker.h - definitions and data structure for opcode tracker objects
* optrackermodule.c - a C module defining the optracker object and a few
    useful functions.
* opstats.py - modeled after the pstats module in the standard
    distribution, provides profiling statistics for opcode objects.
* optrackerscript.py - an example of using this package to profile the opcodes.
    If you have NumPy, it does some more fancy analysis.
* pystone.py - a benchmark script from the Python-1.4 distribution,
    distributed here for demonstration purposes. This is actually a
    very boring program to profile, mainly function calls. Oh, well.


Note that building an interpreter to be optracker-capable must be done
at compile time.

Features:
----------------------------------------
Optracker provides a way to collect execution stats of the Python
interpreter by keeping track of frequency, time, and history of opcode
execution:
 
1) Counting of opcode execution in the Python interpreter.
2) Timing spent executing opcodes (until next opcode is executed).
3) Tracks opcode pair execution, i.e. digraph analysis. (Easily
    interfaces with the Numeric package for more interesting analysis!)
4) Minimal changes to Python/ceval.c
5) Can be completely removed with a compile time option.
6) Stats are reported by interfacing with the existing Stats class in the
     pstats.py file in the distribution. 


What is Actually Being Profiled?
------------------------------------------------------------
There is a loop in Python bytecode interpreter (the for loop around the
massive switch statement in ceval.c:eval_code2()). Each time through
this loop, the count for the appropriate opcode is incremented by
one. That's sufficient to keep raw usage levels for opcodes.

Gathering timing information for the opcodes is complicated by the
fact that the interpreter (eval_code2) is often called recursively. If
there is no recursive call, then the time for the executed opcode is
simply the time through the large switch statement. If there is a
recursive call, than the time for the first opcode accumulated until
the next opcode begins. This means that all the recursive function
call overhead is charged to the previous opcode.  Also, if an opcode
calls out to a C routine, i.e., an extension, all time is charged to
that opcode. Something like this showed up when profiling NumPy: I was
wondering why a BINARY_DIVIDE took about 6 seconds. Then I realized it
was a really large matrix operation. This whole scheme may not be the
best, but it doesn't seem to bad either. There are other strategies
worth exploring and I hope people will play around.


Sample Output:
------------------------------------------------------------
Here's the timing information I get when running the optrackerscript,
a sample script demonstrating optracking.  Note where it says function
calls, that is of course opcodes executed. Also, the tottime (total
time) and cumtime (cumulative time) columns will always be the same.

I didn't understand until I'd finished a first implementation that
timing recursive calls differently (which makes sense for functions)
doesn't make much sense for opcodes. Well, maybe it does. Because the
interpeter can recur, there is some sense of total time spent in one
opcode. But the worth of tracking that seemed pretty dubious at best.

 
         556223 function calls in 25.283 CPU seconds
 
   Ordered by: internal time
   List reduced from 55 to 20 due to restriction <20>
 
   ncalls  tottime  percall  cumtime  percall opcode:opcode id:(NAME)
   158150    4.504 0.000028    4.504 0.000028 opcode:124(LOAD_FAST)
    18185    4.061 0.000223    4.061 0.000223 opcode:131(CALL_FUNCTION)
    34298    2.711 0.000079    2.711 0.000079 opcode:105(LOAD_ATTR)
    25954    2.216 0.000085    2.216 0.000085 opcode:23(BINARY_ADD)
    83446    1.930 0.000023    1.930 0.000023 opcode:127(SET_LINENO)
    67165    1.774 0.000026    1.774 0.000026 opcode:125(STORE_FAST)
    24558    1.232 0.000050    1.232 0.000050 opcode:25(BINARY_SUBSCR)
    39470    0.867 0.000022    0.867 0.000022 opcode:100(LOAD_CONST)
     8660    0.835 0.000096    0.835 0.000096 opcode:95(STORE_ATTR)
       29    0.695 0.023970    0.695 0.023970 opcode:72(PRINT_NEWLINE)
       54    0.648 0.012003    0.648 0.012003 opcode:107(IMPORT_NAME)
     7028    0.454 0.000065    0.454 0.000065 opcode:116(LOAD_GLOBAL)
     6150    0.446 0.000072    0.446 0.000072 opcode:60(STORE_SUBSCR)
    13664    0.387 0.000028    0.387 0.000028 opcode:1(POP_TOP)
     1955    0.386 0.000197    0.386 0.000197 opcode:13(UNARY_CONVERT)
     8273    0.384 0.000046    0.384 0.000046 opcode:92(UNPACK_TUPLE)
    13496    0.325 0.000024    0.325 0.000024 opcode:111(JUMP_IF_FALSE)
     6229    0.266 0.000043    0.266 0.000043 opcode:24(BINARY_SUBTRACT)
     9785    0.250 0.000026    0.250 0.000026 opcode:83(RETURN_VALUE)
     9642    0.209 0.000022    0.209 0.000022 opcode:110(JUMP_FORWARD)
 
Using NumPy for further analysis...

        POP_TOP is followed by   JUMP_IF_FALSE 97.2%
        DUP_TOP is followed by    RETURN_VALUE 90.1%
      UNARY_NOT is followed by       LOAD_FAST 90%
  UNARY_CONVERT is followed by       LOAD_ATTR 90%
  BINARY_DIVIDE is followed by       LOAD_FAST 100%
          SLICE is followed by       LOAD_FAST 98.1%
         SLICE3 is followed by       LOAD_FAST 76.5%
   STORE_SUBSCR is followed by       LOAD_FAST 94.2%
  DELETE_SUBSCR is followed by       LOAD_FAST 100%
     PRINT_ITEM is followed by    RETURN_VALUE 92.2%
  PRINT_NEWLINE is followed by      PRINT_ITEM 89.7%
     BREAK_LOOP is followed by      SET_LINENO 100%
    LOAD_LOCALS is followed by      STORE_NAME 100%
   RETURN_VALUE is followed by      LOAD_CONST 76.5%
      EXEC_STMT is followed by       LOAD_FAST 100%
    END_FINALLY is followed by         POP_TOP 100%
    BUILD_CLASS is followed by    RETURN_VALUE 100%
    UNPACK_LIST is followed by   CALL_FUNCTION 100%
     STORE_ATTR is followed by       LOAD_FAST 98.8%
      LOAD_ATTR is followed by       LOAD_FAST 92.4%
    IMPORT_NAME is followed by      SET_LINENO 100%
    IMPORT_FROM is followed by     IMPORT_NAME 100%
   JUMP_IF_TRUE is followed by       LOAD_FAST 97.1%
       FOR_LOOP is followed by      SET_LINENO 100%
     SETUP_LOOP is followed by      SET_LINENO 100%
   SETUP_EXCEPT is followed by      SET_LINENO 100%
  SETUP_FINALLY is followed by      SET_LINENO 100%
  MAKE_FUNCTION is followed by      LOAD_CONST 100%