How fast can we multiply?

Wed Jul 21 07:43:33 EDT 1999

Tim Peters wrote:
> 
> [Tim]
> >> Andrew Kuchling's wrapping gmp for Python is the most useful thing
> >> that's ever been done in this area.
> 
> [Christian Tismer]

[wants to rebuild the world using nuts & bolts]

> I play with & investigate bigint algorithms in Python, & I enjoy it.  But
> serious work in this area requires serious speed, and GMP doesn't compromise
> on that -- from slaved-over assembler to low-level mutable packed storage
> limbs, it's doing the right stuff for its task.  Factors of 5 count here
> <wink>.

Agreed. Still, if we resolve to low level functions which deal
with an optimized number of bits at a time (some reasonable cutoff),
then the Python overhead to operate intelligently with them
is (hopefully) small enough to be justified.

> You do "long1 & long2" in Python, and it's handled by an all-purpose bitop
> function that redispatches on the operation in each iteration of the inner
> loop.  It's "like that" everywhere:  the bigint code was written for
> compactness (longobject.c is only twice the size of intobject.c!), and
> no-hassle porting to the feeblest machine on earth.  The compromises have
> consequences (some good!  that code has been rock solid since the first
> release, while it can take a week to track down all the bugfix patches to
> the last official GMP release).

Yes. I think this simplicity is a real advantage.

The Python generalism to do the general &-code with
typechecks and everything on every inner loop appears
to be expensive for small objects only.
If I can write algorithms in a way that they operate
only above some "cutoff" level (say 5000 bit ints),
this is again not expensive.

...
[matmult]
> OTOH, Winograd's refinement of the matmult procedure requires 15 floating
> adds, where straight 2x2 matmult requires only 4; float + isn't much cheaper
> than float * on many modern processors.  So the win here isn't nearly as
> pure as for intmult, and the reduction is only from O(n**3) to O(n**2.81)
> operations anyway.  There are reasons few vendors offer this form of matmult
> in their libraries <0.5 wink>.

15 adds, sure. Would be expensive for a 2x2 matrix of reals.
But since we recusrively apply this to 2x2 matrices of submatrices,
the complexity of multiplication quickly voidens the real-processors
optimization again, and we win quickly.

Thinking of Jeffrey Chang's 6000x6000 matrices, what can we expect?
I didn't try, but let's assume a cutoff at about 94x94 matrices,
which appears fairly high for me.
Jeffrey can than think of the multiplication of 64x64 matrices,
where the scalars are the 94x94 matrices of above.
Cost for addition of two 94x94 matrices vanishes, compared to
multiplication.
So Jeffrey might expect to go from 64 ** 3 == 262144
down to 64 ** ( log(7)/log(2) ) == 117649.0 which gives him
a speedup factor of 2.23 .

If we are more lucky, and the cutoff enables us to 47*47 matrices,
the speedup will be 2.55 . And for 24*24 matrices, we end up at
a speedup of 2.91 .

May we expect a win for 12x12 matrices, already?
It would give us a great speed gain of 3.33, but we have to
guess the cutoff.

Reading Knuth about that, the necessary scaling for stability
implies that this techniques pay off for n >= 40, so if
we are lucky, the 47*47 case above could be implemented,
hopefully giving a factor of between 2and 2.5 .

One concern I have, and which stops me to dig seriously into
this is matrix stability. Since we are dealing with floats in
the low level, the result may be impacted by more rounding
errors, since more additions are involved.
At least this has to be studied carefully before replacing
one algorithm by another. I don't know what kind of
preconditioning is necessary here, but Knuth said it is.

Not so with fastlong, this is a perfect replacement.

I have not further investigated if Fourier transform
can be applied, and if matrix multiplication can be
expressed in terms of convolutions and Fourier transforms.
Multiplication of very long integers *can* be done this way.

[and about Jurjen's exact math, we'll wait for volunteers]

ciao - chris

-- 
Christian Tismer             :^)   <mailto:tismer at appliedbiometrics.com>
Applied Biometrics GmbH      :     Have a break! Take a ride on Python's
Kaiserin-Augusta-Allee 101   :    *Starship* http://starship.python.net
10553 Berlin                 :     PGP key -> http://wwwkeys.pgp.net
PGP Fingerprint       E182 71C7 1A9D 66E9 9D15  D3CC D4D7 93E2 1FAE F6DF
     we're tired of banana software - shipped green, ripens at home