3-arg float pow()

Mon Sep 3 13:34:55 EDT 2001

Hi,

David C. Ullrich <ullrich at math.okstate.edu> schrub:

> Seems like pow(3., 500., 7.) can't possibly do anything except just
> 3.**500. % 7(???), hence the random nature of the result and the utter
> uselessness. 

The trick is that it is a exponential function with large result modulo 
a small number. (such operations are e. G. used in the RSA crypto 
algorithm)

When calculating the exponential function using software emulation (as 
I know of no decent hardware support for long Integers), you have more 
or less a loop of multiplications or so (depending how optimized your 
algorithm is).

One optimized algorithm is to split the 500 into powers of two, so 
3**500 gets to 3**(256 + 128 + 64 + 32 + 16 + 4) == (3**256) * (3**128) 
* (3**64) * (3**32) * (3**16) * (3**4). As you see, you can generate 
the bracketed values by repeatedly multiplying the number with itsself 
(which is rather fast) and then multiplying all those values. In a 
computer, you can quite optimize this, as the 500 is already given in 
binary. Thus one can iterate like the following:

def simple(base, exp, modulo):
  res = 1L
  while exp != 0L:
      if exp & 1L:
          res *= base
      exp >>=1L
      base *= base
  return res%modulo

This gives 9 rounds with two comparisons, one shift and maximum two 
multiplications per loop, thus a maximum of 45 operations compared to 
500 when using simple plain loop.

The problem is when the intermetiate number gets rather large (which is 
the case here, we get 3**500 == 
36360291795869936842385267079543319118023385026001623040346035832580600191583895484198508262979388783308179702534403855752855931517013066142992430916562025780021771247847643450125342836565813209972590371590152578728008385990139795377610001). 
Using dynamically sized numbers (as our long integers are), this can 
get rather big (and thus memory- and cpu-intensive) compared to the 
final result 2.

But as we calculate modulo 7, we can apply the modulo after every 
single step of shifting, and so keep the numbers small (which 
is much faster given some numbers, and uses much less memory)

def optimized(base, exp, modulo):
  res = 1L
  while exp != 0L:
      if exp & 1L:
          res *= base
          res %= modulo
      exp >>=1L
      base *= base
      base %= modulo
  return res

Especially when calculating lots of numbers (such as applying this to a 
list or matrix), this can save some megabyte intermediate storage.

A test like 

start = time.clock()
for i in xrange(1000):
  void = simple(3L,5000L,7L)
stend = time.clock()
print 'simple:',round(stend-start,2)

gives 7.28 seconds for the simple and 0.24 seconds for the 
modulo-optimized version on my machine.

On the other side this doesn't apply when using small integers (as they 
don't expand in size), and doesn't gain much when the exponental result 
and the modulo factor are about the same size. E. G. using 50 as exp in 
the example above gives 0.09 seconds in the simple and 0.13 seconds in 
the optimized version due to the additional modulo operations. 

Additionally, the second algorithm can be rewritten to calculate the 
result using ordinary integers (simply remove all the "L"s in the def 
and at the call), and this way only needs 0.05 seconds, whereas the 
first algorithm gives an OverflowError. A math library coder could use 
some heuristics to dynamically decide which algorithm to use.

I ran an additional test, calculating pow(3,5000000,7), and the 
unoptimized version (all Longs) needs about 50 seconds for _one_ loop, 
whereas the optimized version needs for _1000_ loops 0.35 seconds using 
longs, and 0.13 seconds using integers. Btw, the builtin pow function 
needs 0.15 and 0.01 seconds

But I don't know whether this trick is usable in floating point, as 
most floating point is done in hardware nowadays. I can imagine that 
some FPU have machine commands to gain additional speed and maybe 
exactness this way.

Another point may be that when processing arrays or matrices, a 
sophisticated programmer may use MMX, 3DNow!, ISSE or AltiVec to gain 
speed by processing something in parallel, but I'm afraid this would be 
not so easy to incorporate in python, because there numbers are stored 
as references to objects.

PS: I'm away for a week from tomorrow on, so please be patient with my 
answers to your comments / questions.

markus
-- 
"The strength of the Constitution lies entirely in the determination of 
each citizen to defend it. Only if every single citizen feels duty 
bound to do his share in this defense are the constitutional rights 
secure." -- Albert Einstein