[Python-Dev] Optionally using GMP to implement long if available

Mon Nov 10 18:57:05 CET 2008

Here's the test code I was using, modeled on the basic operation
that longobject needs:  multiply two digits together and extract
high and low digits from the result.

typedef uint32_t digit;
typedef uint64_t twodigits;
#define SHIFT 30
#define MASK (((digit)1 << SHIFT) - 1)

/* multiply a and b, and split into high digit (returned)
   and low digit (put into *low) */

extern digit
digit_mul(digit *low, digit a, digit b) {
    twodigits prod;
    prod = (twodigits)a * b;
    *low = (digit)(prod & MASK);
    return (digit)(prod >> SHIFT);
}

Using gcc 4.0.1 on 32-bit x86 and compiling with "gcc -O1 -S test.c"
gives, for me, a file test.s that looks like:

	.text
.globl _digit_mul
_digit_mul:
	pushl	%ebp
	movl	%esp, %ebp
	pushl	%esi
	subl	$4, %esp
	movl	16(%ebp), %eax
	mull	12(%ebp)
	movl	%eax, %esi
	andl	$1073741823, %esi
	movl	8(%ebp), %ecx
	movl	%esi, (%ecx)
	shrdl	$30, %edx, %eax
	addl	$4, %esp
	popl	%esi
	leave
	ret
	.subsections_via_symbols

There's only a single mull instruction there, showing that gcc
is doing the right thing, at least when optimization is turned on.
Without optimization, gcc produces three separate
multiplications, two of which are multiplications
by zero.

But if I compile with the -m64 flag to force 64-bit then the
multiply becomes:

        imulq   %rsi, %rdx

which looks a lot like a 64 x 64 -> 64 multiply to me.  This
seems inefficient, when a 32 x 32 -> 64 bit multiply ought
to be good enough.  But maybe there isn't a significant
performance difference on x86_64?

Mark