Unexpected behaviour of math.floor, round and int functions (rounding)

Sat Nov 20 21:36:35 EST 2021

Chris,

You know I am going to fully agree with you that within some bounds, any combination of numbers that can accurately be represented will continue to be adequately represented under some operations like addition and subtraction and multiplication up to any point where they do not overflow (or underflow) the storage mechanism.

Division may be problematic and especially division by zero. 

But bring in any number that is not fully and accurately representable, and it can poison everything much in the way including an NA poisons any attempts to take a sum or mean. Any calculation that includes an e is an example.

Of course there is not much in computing that necessarily relies on representable numbers and especially not when the numbers are dynamically gotten as in from a file or user and already not quite what is representable. I can even imagine a situation where some fraction is represented in a double and then "copied" into a regular singular float and some of it is lost/truncated. 

I get your point about URL's but I was really focused at that point on filenames as an example on systems where they are not case sensitive. Some programming languages had a similar concept. Yes, you can have URL with more complex comparison functions needed including when something lengthens them or whatever. In one weird sense, as in you GET TO THE SAME PAGE, any URL that redirects you to another might be considered synonymous even if the two look nothing at all alike.

To continue, I do not mean to give the impression that comparing representable numbers with == is generally wrong. I am saying there are places where there may be good reasons for the alternative.

I can imagine an algorithm that starts with representable numbers and maybe at each stage continues to generate representable numbers, such as one of the hill climbing algorithms I am talking about. It may end up overshooting a bit past the peak and next round overshooting back to the other side and getting stuck in a loop. One way out is to keep track of past locations and abort when the cycle is seen to be repeating. Another is to leave when the result seems close enough. 

However, my comments about over/underflow may apply here as enough iterations with representable numbers may at some point result in the kind of rounding error that warps the results of further calculations.

I note some of your argument is the valid difference between when your knowledge of the input numbers is uncertain and what the computer does with them. Yes, my measures of the height/width/depth may be uncertain and it is not the fault of a python program if it multiplies them to provide an exact answer as if in a mathematical world where numbers are normally precise. I am saying that the human using the program needs external info before they use the answer. In my example, I would note the rule that when dealing with numbers that are only significant to some number of digits, the final calculation should often be rounded down according to some rules. So instead of printing out the volume as 4140.606, the program may call some function like round() as in round(10.1*20.2*30.3, 1) so it displays 4141.6 instead. The Python language does what you ask and not what you do not ask. 

Now a statistical program or perhaps an AI or Machine Learning program I write, might actually care about the probabilistic effects. I often create graphs that include perhaps a smoothed curve of some kind that approximates the points in the data as well as a light gray ribbon that represents some kind of error bands above and below and which suggest the line not be taken too seriously and there may be something like a 95% chance the true values are within the gray zone an even some chance they may be beyond it in an even lighter series of gray (color is not the issue) zones representing a 1% chance or even less. 

Such approaches apply if the measurement errors are assume to be as much as .1 inches for each measure independently. The smallest volume would be (10.1 - 0.1)*(20.2 - 0.1)*(30.3 - 0.1) = 6070.2

The largest possible volume if all my measures were off by that amount in the other direction would be:

(10.1 + 0.1)*(20.2 + 0.1)*(30.3 + 0.1) = 6294.6 

The above are truncated to one significant digit. 

The Python program evaluates all the above representable numbers perfectly, albeit I doubt they are all representable in binary. But for human purposes, the actual answer for a volume has some uncertainty built-in to the method of measurement and perhaps others such as the inner sides of the box may not be perfectly flat or the angles things join at may not be precisely 90 degrees and filling it with something like oranges may not fit as much more if you enlarge it a tad as they may not stack much better from a minor change.

Python is not to blame in these cases if not programmed well enough. And I suggest the often minor errors introduced by a representation being not quite right in the last available decimal representation places (binary underneath) may be much smaller than these other introduced errors in the reality of such a superficially simple calculation.

>From my side of this discussion, I do not see much we basically disagree on, albeit we may word some ideas differently. I think I may opt out of further comments unless something new is mentioned and it does relate to Python or similar programming languages. I am spending too much time today on this one and another in an R mailing list and other things elsewhere so my main plans for the day have fallen behind 😉

Avi

-----Original Message-----
From: Python-list <python-list-bounces+avigross=verizon.net at python.org> On Behalf Of Chris Angelico
Sent: Saturday, November 20, 2021 6:23 PM
To: python-list at python.org
Subject: Re: Unexpected behaviour of math.floor, round and int functions (rounding)

On Sun, Nov 21, 2021 at 10:01 AM Avi Gross via Python-list <python-list at python.org> wrote:
> Computers generally use finite methods, sometimes too finite. Yes, the 
> problem is not Mathematics as a field. It is how humans often 
> generalize or analogize from one area into something a bit different. 
> I do not agree with any suggestion that a series of bits that encodes 
> a result that is rounded or truncated is CORRECT. A representation of 
> 0.3 in a binary version of some floating point format is not 
> technically correct. Storing it as 3/10 and carefully later 
> multiplying it by 20 and then carefully canceling part will result in 
> exactly 6. While storing it digitally and then multiplying it in 
> registers or whatever by 20 may get a result slightly different than 
> the storage representation of 6.0000000000... and that is a fact and risk we generally are willing to take.

Do you accept that storing the floating point value 1/4, then multiplying by 20, will give precisely 5? Because that is *guaranteed*. You don't have to expect a result "slightly different"
from 5, it will be absolutely exactly five:

>>> (1/4) * 20 == 5.0
True

This is what I'm talking about. Some numbers can be represented perfectly, others can't. If you try to represent the square root of two as a decimal number, then multiply it by itself, you won't get back precisely 2, because you can't have written out the *exact* square root of two. But you most certainly CAN write "1.875" on a piece of paper, and it really truly does exactly mean fifteen eighths.
And you can write that number as a binary float, too, and it'll mean the exact same value.

> But consider a different example. If I have a filesystem or URL or 
> anything that does not care about whether parts are in upper or lower 
> case, then "filename" and "FILENAME" and many variations like 
> "fIlEnAmE" are all assumed to mean the same thing. A program may even 
> simply store all of them in the same way as all uppercase. But when 
> you ask to compare two versions with a function where case matters, 
> they all test as unequal! So there are ways to ask for a comparison 
> that is approximately equal given the constraints that case does not matter:

A URL has distinct parts to it: the domain has some precise folding done (most notably case folding), the path does not, and you can consider "http://example.com:80/foo" to be the same as "http://example.com/foo" because 80 is the default port.

> >>> alpha="Hello"
> >>> beta="hELLO"
> >>> alpha == beta
> False
> >>> alpha.lower() == beta.lower()
> True
>

That's a terrible way to compare URLs, because it's both too sloppy AND too strict at the same time. But if you have a URL representation tool, it should be able to consider two things equal.

Floats are representations of numbers that can be compared for equality if they truly represent the same number. The value 3/6 is precisely equal to the value 7/14:

>>> 3/6 == 7/14
True

You don't need an "approximately equal" function here. They are the same value. They are equal.

> I see no reason why a comparison canot be done like this in cases you 
> are concerned with small errors creeping in:
>
> >>> from math import isclose
> >>> isclose(1, .9999999999999999999999)
> True
> >>> isclose(1, .9999999999)
> True
> >>> isclose(1, .999)
> False

This is exactly the problem though: HOW close counts as equal? The only way to answer that question is to know the accuracy of your inputs, and the operations done.

> So floats by themselves are not inaccurate but realistically the 
> results of operations ARE. I mean if I ask a long number to be stored 
> that does not fully fit, it is often silently truncated and what the 
> storage location now represent accurately is not my number but the 
> shorter version that is at the limit of tolerance. But consider 
> another analogy often encountered in mathematics.

Not true. Operations are often perfectly accurate.

> If I measure several numbers in the real world such as weight and 
> height and temperature and so on, some are considered accurate only to 
> a limited number of digits. Your weight on a standard digital scale 
> may well be 189.8 but if I add a feather or subtract one, the reading 
> may well shift to one unit up or down. Heck, the same person measured 
> just minutes later may shift. If I used a deluxe scale that measures 
> to more decimal places, it may get hard to get the exact same number 
> twice in a row as just taking a deeper breath may make a change.
>
> So what happens if I measure a box in three dimensions to the nearest 
> .1 inch and decide it is 10.1 by 20.2 by 30.3 inches? What is the 
> volume, ignoring pesky details about the width of the cardboard or whatever?
>
> A straightforward multiplication yields 4141.606 cubic inches. You may 
> have been told to round that to something like 4141.6 because the 
> potential error in each measure cannot result in more precision. In 
> reality, you might even calculate two sets of numbers assuming the 
> true width may have been a tad more or less and come up with the 
> volume being BETWEEN a somewhat smaller number and a somewhat larger number.

If those initial figures were accurate to three digits, you should round it to 4140 cubic inches, because that's all the accuracy you have. (Or, if you prefer, 4140 +/- 5.)

> I claim a similar issue plagues using a computer to deal with stored 
> numbers, perhaps not stored 100% perfectly as discussed, and doing 
> calculations. The result often comes out more precisely than 
> warranted. I suspect there are modules out there that might do 
> multi-step calculations where at each step, numbers generated with 
> extra precision are throttled back so the extra precision is set to 
> zeroes after rounding to avoid the small increments adding up. Others 
> may just do the calculations and keep track and remove extra precision at the end.

When your input values aren't accurate, your output won't be accurate.
That's something the computer can never know. When you store the number 3602879701896397/36028797018963968, did you actually mean that number, or did you mean some other number that's kinda close to it? If you don't tell the computer, it's going to assume that you wanted exactly that number.

> And again, this is not because the implementation of numbers is in any 
> way wrong but because a real-world situation requires the humans to 
> sort of dial back how they are used and not over-reach.
>
> So comparing for close-enough inequality is not necessarily a 
> reflection on floats but on the design not accommodating the precision 
> needed or perhaps on the algorithm used not necessarily being expected 
> to reach a certain level.

And close-enough equality is the correct thing to do when you know exactly what the accuracy of your inputs is. If you need to be completely rigorous about it, you'd have to store every number as a range (so you might say that your input length is "10.05 to 10.15" or "10.1, error 0.5") and do all arithmetic on those ranges. What you'd find is that some operations widen the ranges and others don't. The trouble is, that's not actually all that useful; Fermi estimates are far more accurate than they seem like they "should be" because the balance of probability is in favour of errors cancelling out, at least partially.

ChrisA
--
https://mail.python.org/mailman/listinfo/python-list