[SciPy-user] Polyfit may be poorly conditioned

Fri Mar 28 12:36:25 EDT 2008

On 28/03/2008, Xavier Gnata <xavier.gnata at gmail.com> wrote:

> Hum we have a problem here:
> Polyfit is a linear fit. It can (should?) be implemented using using the
> pseudo-inverse paradigm and it "pseudo-inverse provide you with the solution
> that minimize the chi2. As a result, a 12 ordrer fit cannot be worst than a
> 5 order. It should at least end up with the same chi2.
>
> One trivial exemple :
> In [9]: polyfit(arange(1000),2*arange(1000)+1,30)
> C:\Python25\lib\site-packages\numpy\lib\polynomial.py:306:
> RankWarning:
> Polyfit
> may be poorly conditioned
>  warnings.warn(msg, RankWarning)
>
> Out[9]:
> array([ 4.25295067e-93, -1.90319088e-89, 2.47204987e-86,
>  3.64986075e-84, -2.04058052e-80, -1.02156478e-77,
>  1.38333948e-74, 1.84191780e-71, -1.04858370e-69,
>  -1.93210697e-65, -1.30453334e-62, 1.07062138e-59,
>
>  2.11173521e-56, 2.00842417e-54, -2.16681833e-50,
>  -1.21709485e-47, 1.99280602e-44, 1.67639356e-41,
>  -2.35550438e-38, -1.10826788e-35, 3.53529389e-32,
>  -2.99768374e-29, 1.44456488e-26, -4.49042306e-24,
>
>  9.31940018e-22, -1.28887179e-19, 1.15776067e-17,
>  -6.43505943e-16, 2.04266284e-14, 2.00000000e+00,
>  1.00000000e+00])
> Of course, we also have numerical errors but if they cannot be neglected,
> then to have found a problem which is really ill-conditioned (hudge range of
> values ? values very close to inf or 0 ?)...or there is a pb in poyfit but I
> cannot see this bug.
>
> Any comments?

What exactly is the problem here? This is the best fit that could be
hoped for - the linear and constant terms are right, and the rest are
zero to the best accuracy we can plausibly expect. If you are using a
goodness-of-fit measure that takes into account the number of
parameters fitted (Bayesian model comparison, say, or even a
chi-squared that takes into account "degrees of freedom") this should
be reported as a much worse fit than using only a linear polynomial.

The problem is ill-conditioned because 1000**30 is so much bigger than
999**30, as is 1000**29 compared to 999**29, that the vectors
arange(1000)**30 and arange(1000)**30 are nearly identical. As the
polyfit docstring says, this is a really horrible basis to use for
polynomial fitting.

Since polyfit is quite sensibly based on the SVD, when it encounters
some linear combination of basis vectors that nearly cancels, rather
than introduce enormous coefficients to try to use this combination
for fitting, it just discards it. This results in slightly poorer fits
sometimes, but drastically reduces the numerical headaches that come
with ill-conditioned matrices.  Thus even if a higher-degree
polynomial would fit the data better in a world of exact arithmetic,
occasionally numerical issues force the discarding of some troublesome
combinations of coefficients.

I do think it would be useful to have some mechanism for fitting and
working with polynomials represented in other bases, so that these
numerical issues would be reduced. This could go through at the same
time as an improvement (by specialization) of some of scipy's
orthogonal polynomial functions. But for now I can see no real problem
with polyfit.

Anne