[Python-Dev] Optionally using GMP to implement long if available
Mark Dickinson
dickinsm at gmail.com
Mon Nov 10 18:57:05 CET 2008
Here's the test code I was using, modeled on the basic operation
that longobject needs: multiply two digits together and extract
high and low digits from the result.
typedef uint32_t digit;
typedef uint64_t twodigits;
#define SHIFT 30
#define MASK (((digit)1 << SHIFT) - 1)
/* multiply a and b, and split into high digit (returned)
and low digit (put into *low) */
extern digit
digit_mul(digit *low, digit a, digit b) {
twodigits prod;
prod = (twodigits)a * b;
*low = (digit)(prod & MASK);
return (digit)(prod >> SHIFT);
}
Using gcc 4.0.1 on 32-bit x86 and compiling with "gcc -O1 -S test.c"
gives, for me, a file test.s that looks like:
.text
.globl _digit_mul
_digit_mul:
pushl %ebp
movl %esp, %ebp
pushl %esi
subl $4, %esp
movl 16(%ebp), %eax
mull 12(%ebp)
movl %eax, %esi
andl $1073741823, %esi
movl 8(%ebp), %ecx
movl %esi, (%ecx)
shrdl $30, %edx, %eax
addl $4, %esp
popl %esi
leave
ret
.subsections_via_symbols
There's only a single mull instruction there, showing that gcc
is doing the right thing, at least when optimization is turned on.
Without optimization, gcc produces three separate
multiplications, two of which are multiplications
by zero.
But if I compile with the -m64 flag to force 64-bit then the
multiply becomes:
imulq %rsi, %rdx
which looks a lot like a 64 x 64 -> 64 multiply to me. This
seems inefficient, when a 32 x 32 -> 64 bit multiply ought
to be good enough. But maybe there isn't a significant
performance difference on x86_64?
Mark
More information about the Python-Dev
mailing list