[SciPy-User] Scipy's probplot compared to R's qqplot

josef.pktd at gmail.com josef.pktd at gmail.com
Wed Mar 3 15:08:00 EST 2010


On Wed, Mar 3, 2010 at 2:49 PM,  <PHobson at geosyntec.com> wrote:
>> On Wed, Mar 3, 2010 at 2:09 PM,  <PHobson at geosyntec.com> wrote:
>> > Hey folks,
>> >
>> > I've taken more of an interest in statistics and Scipy lately and
>> decided to compare the scipy.stats.probplot() function to R's qqplot().
>> For a given dataset, the results are slightly different.
>> >
>> > Here's a link to the script I wrote to do the comparison.
>> > http://dpaste.com/167464/
>> >
>> > Basically, it does the following:
>> > -Uses numpy to generate some fake, noramlly distributed data
>> > -Uses both R and Scipy to compute the values needed for
>> quantile/probability plot
>> > -Computes linear regressions on the quantile data with both R and
>> Scipy.
>> > -prints some output to compare the two
>> >
>> > My initial conclusions:
>> > 1) R's lm(y~x) and scipy.stats.linregress(x,y) yield the same slope and
>> intercept of a linear model. (good)
>> > 2) R and Scipy compute the quantiles of a dataset in slightly different
>> manners (??)
>> >
>> > Any clue as to why the discrepancy in #2 occurs? Would you consider it
>> a big deal?
>
>
>> From: scipy-user-bounces at scipy.org [mailto:scipy-user-bounces at scipy.org]
>> On Behalf Of josef.pktd at gmail.com
>> I would consider any significant deviation a big deal, unless we know
>> that there are differences in the definitions or underlying
>> assumptions.
>>
>> I'm not sure what's going on since I never looked at the details of
>> probplot. However, when I plot the quantiles
>> >>> plt.plot(np.sort(qR))
>> >>> plt.plot(qS[0])
>> >>> plt.show()
>>
>> then the graph looks almost the same except for the first and last point.
>
> Yes. When I plotted them, I could not visually distinguish them (see attached). I forgot to mention that.
>
>> qS[0]-np.sort(qR)
>>
>> differs in the second decimal, except for first and last observation.
>> My guess would be that there are some differences for example in the
>> continuity correction, or similar.
>>
>> The boundary points, however, look suspicious.
>
> Thanks for looking  further into this. When I saw that the slopes and intercepts were different, I immediately inspected just the max and min values (laziness, sorry). If I find some time next week, I'll dig around in the source and see if I can't figure out what's happening at those points.

my prime candidate for the 2nd decimal differences, are differences in
the correction

Ui[1:-1] = (i-0.3175)/(N+0.365)

There are several conventions, David Huard posted a list of them
attached to a ticket (?), for empirical cdf.

There might be another correction for boundary points that is different.
    Ui[-1] = 0.5**(1.0/N)
    Ui[0] = 1-Ui[-1]

But for graphical inspection, R and scipy look close enough.

Josef





> -Paul H.
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>



More information about the SciPy-User mailing list