[SciPy-Dev] distributions.py

Sat Sep 15 17:23:36 EDT 2012

On Sat, Sep 15, 2012 at 5:03 PM, nicky van foreest <vanforeest at gmail.com> wrote:
> Hi,
>
> While reading distributions.py I made a kind of private trac list, of
> stuff that might need refactoring, As a matter of fact, all issues
> discussed in the mails above are already on my list. To summarize
> (Please don't take the list below as a complaint, but just factual. I
> am very happy that all this exists.)
>
> 1: the documentation is not clear, too concise, and fragmented;
> actually a bit messy.
>
> 2: there is code overlap in the check work (The lines Ralf mentioned)
> making it hard to find out the differences (but the differences in the
> check work are method dependent so I don't quite know how to tackle
> that in an elegant way),
>
> 3: the docs say that _argscheck need to be rewritten in case users
> build their own distribution. But then the minimal requirement in my
> opinion is that argscheck is simple to understand, and not overly
> generic as it is right now. (I also have examples that its output,
> while in line with its doc string, results in errors.) As far as I can
> see its core can simply be replaced by np.all(cond) (I did not test
> this though).
>
> 4: distributions.py is very big, too big for me actually. I recall
> that my first attempt at finding out how the stats stuff worked was to
> see how expon was implemented. No clue that this resided in
> distributions.py.
>
> What I would like to see, although that would require a considerable
> amount of work, is an architecture like this.
> 1 rv_generic.py containing generic stuff
> 2) rv_continous.py and rv_discrete.py, each imports rv_generic.
> 3) each distribution is covered in a separate file. like expon.py,
> norm, py, etc, and imports rv_continuous.py or rv_discrete.py,
> whatever appropriate.

I think splitting into continuous and discrete is helpful.

But I don't like splitting off the distributions, 90 files for
distributions with 10 to 20 lines of real code each sounds a lot of
files when we need to look for anything.

Actually, I find the large file easy to use, using a search string,
and it makes it easy to compare across distributions. Finding the
generic parts can be difficult.

Josef

Each docstring can/should contain some generic
> part (like now) and a specific part, with working examples, and clear
> explanations. The most important are normal, expon, binom, geom,
> poisson, and perhaps some others. This would also enable others to
> help extend the documentation, examples....
> 4) I would like to move the math parts in continuous.rst to the doc
> string in the related distribution file.  Since mathjax gives such
> nice results on screen, there is also no reason not to include the
> mathematical facts in the doc string of the distribution itself. In
> fact, most (all?) distributions already have a short math description,
> but this is in overlap with continuous.rst.

The main distinction for scipy usually is that docstrings should be
readable in the interpreter as informative strings without being heavy
on latex, while tutorial, and so on are mainly targeted to html.

Josef

>
> I wouldn't mind chopping up distributions.py into the separate
> distributions, and merge it with the maths of continuous.rst. I can
> tackle approx one distribution per day roughly, hence reduce this
> mind-numbing work to roughly 15 minutes a day (correction work on
> exams is much worse :-) ). But I don't know how much this proposal
> will affect the automatic generation of documentation. For the rest I
> don't think this will affect the code a lot.
>
>
>
> NIcky
>
>
>
>
>
> On 15 September 2012 11:59, Ralf Gommers <ralf.gommers at gmail.com> wrote:
>>
>>
>> On Fri, Sep 14, 2012 at 10:56 PM, Jake Vanderplas
>> <vanderplas at astro.washington.edu> wrote:
>>>
>>> On 09/14/2012 01:49 PM, Ralf Gommers wrote:
>>>
>>>
>>>
>>> On Fri, Sep 14, 2012 at 12:48 AM, <josef.pktd at gmail.com> wrote:
>>>>
>>>> On Thu, Sep 13, 2012 at 5:21 PM, nicky van foreest <vanforeest at gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > Now that I understand github (Thanks to Ralf for his explanations in
>>>> > Dutch) and got some simple stuff out of the way in distributions.py I
>>>> > would like to tackle a somewhat harder issue. The function argsreduce
>>>> > is, as far as I can see, too generic. I did some tests to see whether
>>>> > its most generic output, as described by its docstring, is actually
>>>> > swallowed by the callers of argsreduce, but this appears not to be the
>>>> > case.
>>>>
>>>> being generic is not a disadvantage (per se) if it's fast
>>>>
>>>> https://github.com/scipy/scipy/commit/4abdc10487d453b56f761598e8e013816b01a665
>>>> (and a being a one liner is not a disadvantage either)
>>>>
>>>> Josef
>>>>
>>>> >
>>>> > My motivation to simplify the code in distributions.py (and clean it
>>>> > up) is partly based on making it simpler to understand for myself, but
>>>> > also to  others. The fact that github makes code browsing a much nicer
>>>> > experience, perhaps more people will take a look at what's under the
>>>> > hood. But then the code should also be accessible and clean. Are there
>>>> > any reasons not to pursue this path, and focus on more important
>>>> > problems of the stats library?
>>>
>>>
>>> Not sure that argsreduce is the best place to start (see Josef's reply),
>>> but there should be things that can be done to make the code easier to read.
>>> For example, this code is used in ~10 methods of rv_continuous:
>>>
>>>         loc,scale=map(kwds.get,['loc','scale'])
>>>         args, loc, scale = self._fix_loc_scale(args, loc, scale)
>>>         x,loc,scale = map(asarray,(x,loc,scale))
>>>         args = tuple(map(asarray,args))
>>>
>>> Some refactoring may be in order. The same is true of the rest of the
>>> implementation of many of those methods. Some are exactly the same except
>>> for calls to the corresponding underscored method (example: logsf() and
>>> logcdf() are identical except for calls to _logsf() and _logcdf(), and one
>>> nonsensical multiplication).
>>>
>>> Ralf
>>>
>>>
>>>
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>>
>>> I would say that the most important improvement needed in distributions is
>>> in the documentation.
>>>
>>> A new user would look at the doc string of, say, scipy.stats.norm, and
>>> have no idea how to proceed.  Here's the current example from the docstring
>>> of scipy.stats.norm:
>>>
>>> Examples
>>> --------
>>> >>> from scipy.stats import norm
>>> >>> numargs = norm.numargs
>>> >>> [  ] = [0.9,] * numargs
>>> >>> rv = norm()
>>>
>>> >>> x = np.linspace(0, np.minimum(rv.dist.b, 3))
>>> >>> h = plt.plot(x, rv.pdf(x))
>>>
>>> I don't even know what that means... and it doesn't compile.  Also, what
>>> is b?  how would I enter mu and sigma to make a normal distribution?  It's
>>> all pretty opaque.
>>
>>
>> True, the examples are confusing. The reason is that they're generated from
>> a template, and it's pretty much impossible to get clear and concise
>> examples that way. It would be better to write custom examples for the
>> most-used distributions, and refer to those from the others.
>>
>> Ralf
>>
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev