[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array
Mark Dickinson
report at bugs.python.org
Thu Aug 26 04:38:33 EDT 2021
Mark Dickinson <dickinsm at gmail.com> added the comment:
> what it's correcting for is an inaccurate value of "c" [...]
In more detail:
Suppose "m" is the true mean of the x in data, but all we have is an approximate mean "c" to work with. Write "e" for the error in that approximation, so that c = m + e. Then (using Python notation, but treating the expressions as exact mathematical expressions computed in the reals):
sum((x-c)**2 for x in data)
== sum((x-m-e)**2 for x in data)
== sum((x - m)**2 for x in data) - 2 * sum((x - m)*e for x in data)
+ sum(e**2 for x in data)
== sum((x - m)**2 for x in data) - 2 * e * sum((x - m) for x in data)
+ sum(e**2 for x in data)
== sum((x - m)**2 for x in data) + sum(e**2 for x in data)
(because sum((x - m) for x in data) is 0)
== sum((x - m)**2 for x in data) + n*e**2
So the error in our result arising from the error in computing m is that n*e**2 term. And that's the term that's being subtracted here, because
sum(x - c for x in data) ** 2 / n
== sum(x - m - e for x in data) ** 2 / n
== (sum(x - m for x in data) - sum(e for x in data))**2 / n
== (0 - n * e)**2 / n
== n * e**2
----------
_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue39218>
_______________________________________
More information about the Python-bugs-list
mailing list