Python speed vs csharp

Thu Jul 31 23:11:35 EDT 2003

On Wed, 30 Jul 2003 23:09:22 -0700, Mike <mike at nospam.com> wrote:

>Bear with me: this post is moderately long, but I hope it is relatively
>succinct. 
>
>I've been using Python for several years as a behavioral modeling tool for
>the circuits I design. So far, it's been a good trade-off: compiled C++
>would run faster, but the development time of Python is so much faster, and
>the resulting code is so much more reliable after the first pass, that I've
>never been tempted to return to C++. Every time I think stupid thoughts
>like, "I'll bet I could do this in C++," I get out my copy of Scott Meyers'
>"Effecive C++," and I'm quickly reminded why it's better to stick with
>Python (Meyers is a very good author, but points out lots of quirks and
>pitfalls with C++ that I keep thinking that I shouldn't have to worry
>about, much less try to remember). Even though Python is wonderful in that
>regard, there are problems.
>
>Here's the chunk of code that I'm spending most of my time executing:
>
># Rational approximation for erfc(x) (Abramowitz & Stegun, Sec. 7.1.26)
># Fifth order approximation. |error| <= 1.5e-7 for all x
>#
>def erfc( x ):
>   p  =  0.3275911
>   a1 =  0.254829592
>   a2 = -0.284496736
>   a3 =  1.421413741
>   a4 = -1.453152027
>   a5 =  1.061405429
>
>   t = 1.0 / (1.0 + p*float(x))
>   erfcx = ( (a1 + (a2 + (a3 +
>             (a4 + a5*t)*t)*t)*t)*t ) * math.exp(-(x**2))
>   return erfcx
>
>This is an error function approximation, which gets called around 1.5
>billion times during the simulation, and takes around 3500 seconds (just
>under an hour) to complete. While trying to speed things up, I created a
>simple test case with the code above and a main function to call it 10
>million times. The code takes roughly 210 seconds to run. 

I recoded the above and also in-lined the code, and also made a C DLL module version
and timed them roughly:

====< erfcopt.py >=====================================================
import math
def erfc_old( x ):
    p  =  0.3275911
    a1 =  0.254829592
    a2 = -0.284496736
    a3 =  1.421413741
    a4 = -1.453152027
    a5 =  1.061405429

    t = 1.0 / (1.0 + p*float(x))
    erfcx = ( (a1 + (a2 + (a3 +
              (a4 + a5*t)*t)*t)*t)*t ) * math.exp(-(x**2))
    return erfcx

#   You know that all those "constant" assignments are
#   actually  executed  every  time  erfc  is  called,
#   right?  To  move  that  to  def-time  instead   of
#   call-time, you can write

def erfc( x, # constants bound at def time, rparen moved down, commas added ):
    p  =  0.3275911,
    a1 =  0.254829592,
    a2 = -0.284496736,
    a3 =  1.421413741,
    a4 = -1.453152027,
    a5 =  1.061405429,
    exp = math.exp
 ): # just to make it stand out
    t = 1.0 / (1.0 + p*x) #XXX# don't need float(), since p is float
    return  ( (a1 + (a2 + (a3 +
              (a4 + a5*t)*t)*t)*t)*t ) * exp(-x**2) #XXX# elim attr lookup on global math.exp
    # elim intermediate store of erfcx #XXX# return erfcx

def test():
    from time import clock
    from erfc import erfc as erfc_in_c

    # sanity check
    for x in range(20): assert erfc_old(x) == erfc(x) and abs(erfc_in_c(x)-erfc_old(x))<1e-18

    t1=clock()
    for i in xrange(100000): erfc_old(1.2345)
    t2=clock()
    for i in xrange(100000): erfc(1.2345)
    t3=clock()
    # inlining code from old version:
    # constant assignments hoisted out of loop, of course
    p  =  0.3275911
    a1 =  0.254829592
    a2 = -0.284496736
    a3 =  1.421413741
    a4 = -1.453152027
    a5 =  1.061405429
    x = 1.2345
    for i in xrange(100000):
        t = 1.0 / (1.0 + p*float(x))
        erfcx = ( (a1 + (a2 + (a3 +
                  (a4 + a5*t)*t)*t)*t)*t ) * math.exp(-(x**2))
    t4 = clock()     
    for i in xrange(100000): erfc_in_c(1.2345)
    t5=clock()
    print 'old: %f, new: %f, inline: %f, in_c %f' %(t2-t1, t3-t2, t4-t3, t5-t4)

    import dis
    print '---- erfc_old code -----------------------------------'
    dis.dis(erfc_old)
    print '---- erfc --------------------------------------------'
    dis.dis(erfc)

if __name__== '__main__':
    test()
=======================================================================

Result is:

[19:48] C:\pywk\cstuff>erfcopt.py
old: 4.490975, new: 3.123005, inline: 3.051674, in_c 0.806907

I was surprised that the inline gain was not more, but I guess there was enough computation
to obscure that cost. Anyway, moving some of the work to def-time paid off about 30%. But
going to C cut 82%.

I added the C version to the sanity check, and had to allow an epsilon. I suspect that this
is because the C version probably keeps the whole problem in the FPU with 64 fractional bits
of precision until returning the final double (which has 53 bits), wherease the Python
algorithm presumably stores all intermediate values in memory with the 53-bit precision,
and so loses in the overall. That's my theory, anyway.

To see the extra work the old code is doing vs the new, you can see below:

---- erfc_old code -----------------------------------
          0 SET_LINENO               2

          3 SET_LINENO               3
          6 LOAD_CONST               1 (0.32759110000000002)
          9 STORE_FAST               3 (p)

         12 SET_LINENO               4
         15 LOAD_CONST               2 (0.25482959199999999)
         18 STORE_FAST               2 (a1)

         21 SET_LINENO               5
         24 LOAD_CONST               3 (-0.28449673599999997)
         27 STORE_FAST               5 (a2)

         30 SET_LINENO               6
         33 LOAD_CONST               4 (1.4214137410000001)
         36 STORE_FAST               4 (a3)

         39 SET_LINENO               7
         42 LOAD_CONST               5 (-1.453152027)
         45 STORE_FAST               7 (a4)

         48 SET_LINENO               8
         51 LOAD_CONST               6 (1.0614054289999999)
         54 STORE_FAST               6 (a5)

Everything above here was unnecessary to do at run-time

         57 SET_LINENO              10
         60 LOAD_CONST               7 (1.0)
         63 LOAD_CONST               7 (1.0)
         66 LOAD_FAST                3 (p)
         69 LOAD_GLOBAL              6 (float)
         72 LOAD_FAST                0 (x)
         75 CALL_FUNCTION            1
         78 BINARY_MULTIPLY
         79 BINARY_ADD
         80 BINARY_DIVIDE
         81 STORE_FAST               8 (t)

         84 SET_LINENO              11
         87 LOAD_FAST                2 (a1)
         90 LOAD_FAST                5 (a2)
         93 LOAD_FAST                4 (a3)
         96 LOAD_FAST                7 (a4)
         99 LOAD_FAST                6 (a5)
        102 LOAD_FAST                8 (t)
        105 BINARY_MULTIPLY
        106 BINARY_ADD
        107 LOAD_FAST                8 (t)
        110 BINARY_MULTIPLY
        111 BINARY_ADD
        112 LOAD_FAST                8 (t)
        115 BINARY_MULTIPLY
        116 BINARY_ADD
        117 LOAD_FAST                8 (t)
        120 BINARY_MULTIPLY
        121 BINARY_ADD
        122 LOAD_FAST                8 (t)
        125 BINARY_MULTIPLY
        126 LOAD_GLOBAL              9 (math)    <<--+-- replaced by a single LOAD_FAST
        129 LOAD_ATTR               10 (exp)     <<--+
        132 LOAD_FAST                0 (x)
        135 LOAD_CONST               8 (2)
        138 BINARY_POWER
        139 UNARY_NEGATIVE
        140 CALL_FUNCTION            1
        143 BINARY_MULTIPLY
        144 STORE_FAST               1 (erfcx)   <<--+
                                                     | 
        147 SET_LINENO              13           <<--+
        150 LOAD_FAST                1 (erfcx)   <<--+-- all eliminated
        153 RETURN_VALUE
        154 LOAD_CONST               0 (None)
        157 RETURN_VALUE
---- erfc --------------------------------------------
          0 SET_LINENO              20

          3 SET_LINENO              29
          6 LOAD_CONST               1 (1.0)
          9 LOAD_CONST               1 (1.0)
         12 LOAD_FAST                1 (p)
         15 LOAD_FAST                0 (x)
         18 BINARY_MULTIPLY
         19 BINARY_ADD
         20 BINARY_DIVIDE
         21 STORE_FAST               8 (t)

         24 SET_LINENO              30
         27 LOAD_FAST                2 (a1)
         30 LOAD_FAST                3 (a2)
         33 LOAD_FAST                4 (a3)
         36 LOAD_FAST                5 (a4)
         39 LOAD_FAST                6 (a5)
         42 LOAD_FAST                8 (t)
         45 BINARY_MULTIPLY
         46 BINARY_ADD
         47 LOAD_FAST                8 (t)
         50 BINARY_MULTIPLY
         51 BINARY_ADD
         52 LOAD_FAST                8 (t)
         55 BINARY_MULTIPLY
         56 BINARY_ADD
         57 LOAD_FAST                8 (t)
         60 BINARY_MULTIPLY
         61 BINARY_ADD
         62 LOAD_FAST                8 (t)
         65 BINARY_MULTIPLY
         66 LOAD_FAST                7 (exp)    <<--
         69 LOAD_FAST                0 (x)
         72 LOAD_CONST               2 (2)
         75 BINARY_POWER
         76 UNARY_NEGATIVE
         77 CALL_FUNCTION            1
         80 BINARY_MULTIPLY
         81 RETURN_VALUE                        <<--
         82 LOAD_CONST               0 (None)
         85 RETURN_VALUE

[19:49] C:\pywk\cstuff>

Regards,
Bengt Richter