Bug in floating point multiplication

Mon Jul 6 11:44:00 EDT 2015

On Sat, 4 Jul 2015 at 02:12 Jason Swails <jason.swails at gmail.com> wrote:

> On Fri, Jul 3, 2015 at 11:13 AM, Oscar Benjamin <
> oscar.j.benjamin at gmail.com> wrote:
>
>> On 2 July 2015 at 18:29, Jason Swails <jason.swails at gmail.com> wrote:
>>
>> Where is the 32 bit one looks like:
>>
>> $ objdump -d a.out.32 | less
>> ...
>
>  804843e:  fildl  -0x14(%ebp)
>>  8048441:  fmull  -0x10(%ebp)
>>  8048444:  fnstcw -0x1a(%ebp)
>>  8048447:  movzwl -0x1a(%ebp),%eax
>>  804844b:  mov    $0xc,%ah
>>  804844d:  mov    %ax,-0x1c(%ebp)
>>  8048451:  fldcw  -0x1c(%ebp)
>>  8048454:  fistpl -0x20(%ebp)
>>  8048457:  fldcw  -0x1a(%ebp)
>>  804845a:  mov    -0x20(%ebp),%eax
>>  804845d:  cmp    -0x14(%ebp),%eax
>>  8048460:  jne    8048477 <main+0x5c>
>>  8048462:  sub    $0x8,%esp
>>  8048465:  pushl  -0x14(%ebp)
>>  8048468:  push   $0x8048520
>>  804846d:  call   80482f0 <printf at plt>
>>  8048472:  add    $0x10,%esp
>>  8048475:  jmp    8048484 <main+0x69>
>>  8048477:  addl   $0x1,-0x14(%ebp)
>>  804847b:  cmpl   $0xf423f,-0x14(%ebp)
>>  8048482:  jle    804843e <main+0x23>
>> ...
>>
>> So the 64 bit one is using SSE instructions and the 32-bit one is
>> using x87. That could explain the difference you see at the C level
>> but I don't see it on this CPU (/proc/cpuinfo says Intel(R) Core(TM)
>> i5-3427U CPU @ 1.80GHz).
>>
>
> Hmm.  Well that could explain why you don't get the same results as me.
> My CPU is a
> AMD FX(tm)-6100 Six-Core Processor
>  (from /proc/cpuinfo).  My objdump looks the same as yours for the 64-bit
> version, but for 32-bit it looks like:
>

So if we have different generated machine instructions it suggests a
difference in the way it was compiled rather than in the hardware itself.
(Although it could be that the compilers were changed because the hardware
was inconsistent in this particular usage).

> However, I have no experience looking at raw assembler, so I can't discern
> what it is I'm even looking at (nor do I know what explicit SSE
> instructions look like in assembler).
>

The give away is that SSE instructions use the XMM registers so where you
see %xmm0 etc data is being loaded ready for SSE instructions. I'll
translate the important part of the 32 bit code below:

...
 804843a:       db 44 24 14             fildl  0x14(%esp)

FILDL - F is for float and indicates that this is an x87 instruction. LD is
for load and indicates that we want to load from memory to the x87 register
stack. The I is for integer: we're loading from int format. I can't
remember what the L suffix means.

This instruction will load the integer from the memory stack at offset 0x14
from the stack pointer (esp) pushing it onto the x87 stack. Since the x87
stack uses 80-bit extended-real precision format internally this
instruction performs a format conversion. The conversion should be lossless
as 80-bit FP uses a 64 bit mantissa so it should be able to represent any
64-bit signed or unsigned integer (we're only using 32 bits in this
example).

 804843e:       dc 4c 24 18             fmull  0x18(%esp)

FMULL (x87 multiply) - This multiplies the top of the x87 stack (the number
we previously stored) by the 64 bit float at 0x18 stack offset. In context
that's the number 1-.5**53 which was presumably calculated earlier. The top
of the x87 stack is overwritten with the result. This calculation is
performed using 80-bit floating point with a 64 bit mantissa.

 8048442:       dd 5c 24 08             fstpl  0x8(%esp)

FSTPL (x87 store and pop) - Pop the top of the x87 stack pushing to the
memory stack. This is storing the result of the multiplication back to RAM
(0x8 from stack) as a 64 bit float. Since it was internally stored as an
80-bit float there is a format conversion here that can lose precision.
Rounding can occur according to the settings in the x87 control word.

 8048446:       f2 0f 2c 44 24 08       cvttsd2si 0x8(%esp),%eax

This converts the 64 bit float stored at 0x8 from stack to a signed int,
truncating if inexact, and stores the result in register eax. This
operation is guaranteed to truncate (rather than floor etc) if it needs to
round.

 804844c:       3b 44 24 14             cmp    0x14(%esp),%eax

This compares the result of the above processing (now stored in eax) with
the int we started with (still at offset 0x14 from stack). The cmp
instruction sets flags ready for the jne instruction below.

 8048450:       75 16                   jne    8048468 <main+0x4b>

JNE (jump if not equal) - this is the branching instruction. If the
previously compared values are unequal we skip forward to 8048468 otherwise
we execute the instructions below which call printf and jmp (break) out of
the loop.

 8048452:       8b 44 24 14             mov    0x14(%esp),%eax
 8048456:       89 44 24 04             mov    %eax,0x4(%esp)
 804845a:       c7 04 24 10 85 04 08    movl   $0x8048510,(%esp)
 8048461:       e8 8a fe ff ff          call   80482f0 <printf at plt>
 8048466:       eb 0f                   jmp    8048477 <main+0x5a>

This is where the jne goes. We increment the int and jump (jle) back to the
top of the loop (unless the loop is finished).

 8048468:       83 44 24 14 01          addl   $0x1,0x14(%esp)
 804846d:       81 7c 24 14 3f 42 0f    cmpl   $0xf423f,0x14(%esp)
 8048474:       00
 8048475:       7e c3                   jle    804843a <main+0x1d>
...

My suspicion falls on the FSTPL instruction. The rounding behaviour there
depends on the rounding flags set in the x87 control word which have not
been explicitly controlled here. The calculation we want to perform cannot
be performed exactly in 64 bit floating point:

>>> from fractions import Fraction as F
>>> x = 1-.5**53
>>> F(x)*F(2049) == F(x*2049)
False

This is because the true result cannot be represented as a 64 bit float.
Instead the result comes out to the nearest available float:

>>> float(F(x)*F(2049)) == x*2049
True

This means that the x87 register will be storing a higher precision result
in its 80 bit format. This result will have to be rounded by the FSTPL
instruction.

If you look at the assembly output I showed you'll see the instructions
FNSTCW/FLDCW (x87 store/load control word) which are used to manipulate the
control word to tell the FPU how to perform this kind of rounding. The fact
that we don't see it in your compiled output could indicate a C compiler
bug which could in turn explain the different behaviour people see from
Python.

To understand exactly why 2049 is the number where it fails consider that
it is the smallest integer that requires 12 bits of mantissa in floating
point format. The number 1-.5**53 has a mantissa that is 53 ones:

>>> x = 1-.5**53
>>> x.hex()
'0x1.fffffffffffffp-1'

When extended to 80 bit real-extended format with a 64 bit mantissa it will
have 11 trailing zeros. So I think multiplication of x with any integer
less than 2049 can be performed exactly by the FMULL instruction. I haven't
fully considered what impact that would have but it seems reasonable that
this is why 2049 is the first number that fails.

--
Oscar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-list/attachments/20150706/22f53dc4/attachment.html>