[SciPy-Dev] distributions.py

Sun Sep 16 14:39:25 EDT 2012

On Sat, Sep 15, 2012 at 5:03 PM, nicky van foreest <vanforeest at gmail.com> wrote:
> Hi,
>
> While reading distributions.py I made a kind of private trac list, of
> stuff that might need refactoring, As a matter of fact, all issues
> discussed in the mails above are already on my list. To summarize
> (Please don't take the list below as a complaint, but just factual. I
> am very happy that all this exists.)
>
> 1: the documentation is not clear, too concise, and fragmented;
> actually a bit messy.
>
> 2: there is code overlap in the check work (The lines Ralf mentioned)
> making it hard to find out the differences (but the differences in the
> check work are method dependent so I don't quite know how to tackle
> that in an elegant way),

I looked in the past how the conditions are build, and I gave up
trying to unify them  after a short time.
pdf is zero outside of support
cdf, sf is zero or one outside of support
ppf, isf produces nan if not in [0,1]

boundary points are either included or treated explicitly
all produce nan if shape parameter is invalid.

reading the conditions for all corner cases might cause headaches :)

>
> 3: the docs say that _argscheck need to be rewritten in case users
> build their own distribution. But then the minimal requirement in my
> opinion is that argscheck is simple to understand, and not overly
> generic as it is right now. (I also have examples that its output,
> while in line with its doc string, results in errors.) As far as I can
> see its core can simply be replaced by np.all(cond) (I did not test
> this though).

np.all(cond)  will not work

from code comment:
"Returns condition array of 1's where arguments are correct and 0's
where they are not."

_argcheck is *elementwise* check for valid parameters
furthermore, in some cases _argcheck needs to set a, b if those depend
on shape parameters.

no ``def __init__()``

Josef

>
> 4: distributions.py is very big, too big for me actually. I recall
> that my first attempt at finding out how the stats stuff worked was to
> see how expon was implemented. No clue that this resided in
> distributions.py.
>
> What I would like to see, although that would require a considerable
> amount of work, is an architecture like this.
> 1 rv_generic.py containing generic stuff
> 2) rv_continous.py and rv_discrete.py, each imports rv_generic.
> 3) each distribution is covered in a separate file. like expon.py,
> norm, py, etc, and imports rv_continuous.py or rv_discrete.py,
> whatever appropriate. Each docstring can/should contain some generic
> part (like now) and a specific part, with working examples, and clear
> explanations. The most important are normal, expon, binom, geom,
> poisson, and perhaps some others. This would also enable others to
> help extend the documentation, examples....
> 4) I would like to move the math parts in continuous.rst to the doc
> string in the related distribution file.  Since mathjax gives such
> nice results on screen, there is also no reason not to include the
> mathematical facts in the doc string of the distribution itself. In
> fact, most (all?) distributions already have a short math description,
> but this is in overlap with continuous.rst.
>
> I wouldn't mind chopping up distributions.py into the separate
> distributions, and merge it with the maths of continuous.rst. I can
> tackle approx one distribution per day roughly, hence reduce this
> mind-numbing work to roughly 15 minutes a day (correction work on
> exams is much worse :-) ). But I don't know how much this proposal
> will affect the automatic generation of documentation. For the rest I
> don't think this will affect the code a lot.
>
>
>
> NIcky
>
>
>
>
>
> On 15 September 2012 11:59, Ralf Gommers <ralf.gommers at gmail.com> wrote:
>>
>>
>> On Fri, Sep 14, 2012 at 10:56 PM, Jake Vanderplas
>> <vanderplas at astro.washington.edu> wrote:
>>>
>>> On 09/14/2012 01:49 PM, Ralf Gommers wrote:
>>>
>>>
>>>
>>> On Fri, Sep 14, 2012 at 12:48 AM, <josef.pktd at gmail.com> wrote:
>>>>
>>>> On Thu, Sep 13, 2012 at 5:21 PM, nicky van foreest <vanforeest at gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > Now that I understand github (Thanks to Ralf for his explanations in
>>>> > Dutch) and got some simple stuff out of the way in distributions.py I
>>>> > would like to tackle a somewhat harder issue. The function argsreduce
>>>> > is, as far as I can see, too generic. I did some tests to see whether
>>>> > its most generic output, as described by its docstring, is actually
>>>> > swallowed by the callers of argsreduce, but this appears not to be the
>>>> > case.
>>>>
>>>> being generic is not a disadvantage (per se) if it's fast
>>>>
>>>> https://github.com/scipy/scipy/commit/4abdc10487d453b56f761598e8e013816b01a665
>>>> (and a being a one liner is not a disadvantage either)
>>>>
>>>> Josef
>>>>
>>>> >
>>>> > My motivation to simplify the code in distributions.py (and clean it
>>>> > up) is partly based on making it simpler to understand for myself, but
>>>> > also to  others. The fact that github makes code browsing a much nicer
>>>> > experience, perhaps more people will take a look at what's under the
>>>> > hood. But then the code should also be accessible and clean. Are there
>>>> > any reasons not to pursue this path, and focus on more important
>>>> > problems of the stats library?
>>>
>>>
>>> Not sure that argsreduce is the best place to start (see Josef's reply),
>>> but there should be things that can be done to make the code easier to read.
>>> For example, this code is used in ~10 methods of rv_continuous:
>>>
>>>         loc,scale=map(kwds.get,['loc','scale'])
>>>         args, loc, scale = self._fix_loc_scale(args, loc, scale)
>>>         x,loc,scale = map(asarray,(x,loc,scale))
>>>         args = tuple(map(asarray,args))
>>>
>>> Some refactoring may be in order. The same is true of the rest of the
>>> implementation of many of those methods. Some are exactly the same except
>>> for calls to the corresponding underscored method (example: logsf() and
>>> logcdf() are identical except for calls to _logsf() and _logcdf(), and one
>>> nonsensical multiplication).
>>>
>>> Ralf
>>>
>>>
>>>
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>>> I would say that the most important improvement needed in distributions is
>>> in the documentation.
>>>
>>> A new user would look at the doc string of, say, scipy.stats.norm, and
>>> have no idea how to proceed.  Here's the current example from the docstring
>>> of scipy.stats.norm:
>>>
>>> Examples
>>> --------
>>> >>> from scipy.stats import norm
>>> >>> numargs = norm.numargs
>>> >>> [  ] = [0.9,] * numargs
>>> >>> rv = norm()
>>>
>>> >>> x = np.linspace(0, np.minimum(rv.dist.b, 3))
>>> >>> h = plt.plot(x, rv.pdf(x))
>>>
>>> I don't even know what that means... and it doesn't compile.  Also, what
>>> is b?  how would I enter mu and sigma to make a normal distribution?  It's
>>> all pretty opaque.
>>
>>
>> True, the examples are confusing. The reason is that they're generated from
>> a template, and it's pretty much impossible to get clear and concise
>> examples that way. It would be better to write custom examples for the
>> most-used distributions, and refer to those from the others.
>>
>> Ralf
>>
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev