[Numpy-discussion] pareto docstring

Mon May 10 23:37:20 EDT 2010

On Mon, May 10, 2010 at 2:14 PM, T J <tjhnson at gmail.com> wrote:
> On Sun, May 9, 2010 at 4:49 AM,  <josef.pktd at gmail.com> wrote:
>>
>> I think this is the same point, I was trying to make last year.
>>
>> Instead of renormalizing, my conclusion was the following,
>> (copied from the mailinglist August last year)
>>
>> """
>> my conclusion:
>> ---------------------
>> What numpy.random.pareto actually produces, are random numbers from a
>> pareto distribution with lower bound m=1, but location parameter
>> loc=-1, that shifts the distribution to the left.
>>
>> To actually get useful  random numbers (that are correct in the usual
>> usage http://en.wikipedia.org/wiki/Pareto_distribution), we need to
>> add 1 to them.
>> stats.distributions doesn't use mtrand.pareto
>>
>> rvs_pareto = 1 + numpy.random.pareto(a, size)
>>
>> """
>>
>> I still have to work though the math of your argument, but maybe we
>> can come to an agreement how the docstrings (or the function) should
>> be changed, and what numpy.random.pareto really means.
>>
>> Josef
>> (grateful, that there are another set of eyes on this)
>>
>>
>
>
> Yes, I think my "renormalizing" statement is incorrect as it is really
> just sampling from a different pdf altogether.  See the following image:
>
> http://www.dumpt.com/img/viewer.php?file=q9tfk7ehxsw865vn067c.png
>
> It plots histograms of the various implementations against the pdfs.
> Summarizing:
>
> The NumPy implementation is based on (Devroye p. 262).  The pdf listed
> there is:
>
>    a / (1+x)^(a+1)
>
> This differs from the "standard" Pareto pdf:
>
>    a / x^(a+1)
>
> It also differs from the pdf of the generalized Pareto distribution,
> with scale=1 and location=0:
>
>    (1 + a x)^(-1/a - 1)
>
> And it also differs from the pdf of the generalized Pareto
> distribution with scale=1 and location=-1  or location=1.
>
> random.paretovariate and scipy.stats.pareto sample from the standard
> Pareto, and this is the desired behavior, IMO.  Its true that "1 +
> np.random.pareto" provides the fix, but I think we're better off
> changing the underlying implementation.  Devroye has a more recent
> paper:
>
>  http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.85.8760
>
> which states the Pareto distribution in the standard way.  So I think
> it is safe to make this change.  Backwards compatibility might be the
> only argument for not making this change.  So here is my proposal:
>
>  1) Remove every mention of the generalized Pareto distribution from
> the docstring.  As far as I can see, the generalized Pareto
> distribution does not reduce to the "standard" Pareto at all.  We can
> still mention scipy.stats.distributions.genpareto and
> scipy.stats.distributions.pareto.  The former is something different
> and the latter will (now) be equivalent to the NumPy function.
>
>  2) Modify numpy/random/mtrand/distributions.c in the following way:
>
> double rk_pareto(rk_state *state, double a)
> {
>   //return exp(rk_standard_exponential(state)/a) - 1;
>   return 1.0 / rk_double(state)**(1.0 / a);
> }
>
> Does this sound good?

I went googling and found a new interpretation

numpy.random.pareto is actually the Lomax distribution also known as Pareto 2,
Pareto (II) or Pareto Second Kind distribution

http://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/pa2pdf.htm
http://www.mathwave.com/articles/pareto_2_lomax_distribution.html
http://hosho.ees.hokudai.ac.jp/~kubo/Rdoc/library/VGAM/html/lomax.html

which is different from the Pareto (First Kind) distribution
http://www.mathwave.com/help/easyfit/html/analyses/distributions/pareto.html
http://en.wikipedia.org/wiki/Pareto_distribution#Density_function

The R-VGAM docs have the reference to
http://books.google.ca/books?id=pGsAZ3W7uEAC&pg=PA226&lpg=PA226&dq=Kleiber,+C.+and+Kotz,+S.+%282003%29+Statistical+Size+Distributions+in+Economics+and+Actuarial+Sciences+pareto+lomax&source=bl&ots=j1-AoRxm5E&sig=KDrWehJW5kt-EKH1VjDe-lFMRpw&hl=en&ei=csPoS_bgDYT58Aau46z2Cg&sa=X&oi=book_result&ct=result&resnum=1&ved=0CBUQ6AEwAA#v=onepage&q&f=false

which states on page 227

X is distributed as Lomax(b,q) <=> X + b is distributed Pareto(b, q)

where b is the scale parameter b=1 in numpy.random.pareto notation
and q is the shape parameter alpha

quote:
"... the Pareto (II) distribution - after all, it's just a shifted
classical Pareto distribution - ..."

So, from this it looks like numpy.random does not have a Pareto
distribution, only Lomax, and the confusion came maybe because
somewhere in the history the (II) (second kind) got dropped in the
explanations.

and actually it is in scipy.stats.distributions, but without rvs

# LOMAX (Pareto of the second kind.)
#  Special case of Pareto of the first kind (location=-1.0)

class lomax_gen(rv_continuous):
    def _pdf(self, x, c):
        return c*1.0/(1.0+x)**(c+1.0)
    def _cdf(self, x, c):
        return 1.0-1.0/(1.0+x)**c
    def _ppf(self, q, c):
        return pow(1.0-q,-1.0/c)-1
    def _stats(self, c):
        mu, mu2, g1, g2 = pareto.stats(c, loc=-1.0, moments='mvsk')
        return mu, mu2, g1, g2
    def _entropy(self, c):
        return 1+1.0/c-log(c)
lomax = lomax_gen(a=0.0, name="lomax",
                  longname="A Lomax (Pareto of the second kind)",
                  shapes="c", extradoc="""

Lomax (Pareto of the second kind) distribution

lomax.pdf(x,c) = c / (1+x)**(c+1)
for x >= 0, c > 0.
"""

There are too many distribution, and too many different names.

>So here is my proposal:
>
>  1) Remove every mention of the generalized Pareto distribution from
> the docstring.  As far as I can see, the generalized Pareto
> distribution does not reduce to the "standard" Pareto at all.  We can
> still mention scipy.stats.distributions.genpareto and
> scipy.stats.distributions.pareto.  The former is something different
> and the latter will (now) be equivalent to the NumPy function.

agreed, although I haven't figured out yet why Pareto and generalized
Pareto have similar names

>
>  2) Modify numpy/random/mtrand/distributions.c in the following way:
>
> double rk_pareto(rk_state *state, double a)
> {
>   //return exp(rk_standard_exponential(state)/a) - 1;
>   return 1.0 / rk_double(state)**(1.0 / a);
> }

I'm not an expert on random number generator, but using the uniform distribution
as in
http://en.wikipedia.org/wiki/Pareto_distribution#Generating_a_random_sample_from_Pareto_distribution
and your Devroy reference seems better, than based on the relationship to
the exponential distribution
http://en.wikipedia.org/wiki/Pareto_distribution#Relation_to_the_exponential_distribution

I think without changing the source we can rewrite the docstring that
this is Lomax (or
Pareto of the Second Kind), so that at least the documentation is less
misleading.

But I find calling it Pareto very confusing, and I'm not the only one anymore,
(and I don't know if anyone has used it assuming it is classical Pareto),
so my preferred solution would be

* rename numpy.random.pareto to numpy.random.lomax
* and create a real (classical, first kind) pareto distribution (even
though it's just
  adding or subtracting 1, ones we know it)

(and I'm adding the _rvs to scipy.stats.lomax)

What's the backwards compatibility policy with very confusing names in numpy?

Josef

> _______________________________________________
> NumPy-Discussion mailing list
> NumPy-Discussion at scipy.org
> http://mail.scipy.org/mailman/listinfo/numpy-discussion
>