[SciPy-Dev] distributions.py

josef.pktd at gmail.com josef.pktd at gmail.com
Sun Sep 16 15:37:23 EDT 2012


On Sun, Sep 16, 2012 at 3:33 PM, nicky van foreest <vanforeest at gmail.com> wrote:
> On 16 September 2012 20:34, Warren Weckesser
> <warren.weckesser at enthought.com> wrote:
>>
>>
>> On Sun, Sep 16, 2012 at 1:17 PM, nicky van foreest <vanforeest at gmail.com>
>> wrote:
>>>
>>> > One comment: I like your idea of splitting the generic code to a
>>> > separate file.  But I'd hesitate to create a separate file for each
>>> > distribution: that's a lot of files.  In my opinion, a good compromise
>>> > would be to create one file for continuous distributions, and one for
>>> > discrete.  All of this could be in a new "scipy.stats.distributions"
>>> > submodule, for the sake of code organization.
>>>
>>> I just responded to Josef. This proposal makes the most sense I guess.
>>>
>>> >
>>> > Also, I'd add one more item to your list: make sure all code is PEP8
>>> > compliant.  Sometimes the PEP8 guidelines can seem a bit cumbersome, but
>>> > they do make browsing and understanding code much easier.
>>>
>>> I'll check the pep8 documentation.
>>>
>>> I guess that improving the documentation is most important for the
>>> moment. Once this is done, we can go on with splitting
>>> distributions.py into two or three smaller files.
>>>
>>
>>
>> FWIW, I'm strongly in favor of the following:
>> * Split distributions.py into three pieces (generic, discrete, continuous).
>
> I'll put that in my local to do list.
>
>> * Fix the screwy docstrings of the distributions (see the example Jake
>> showed in a previous email).
>
> As said in the mail to Jake, I think it is best to update the
> docstrings first. That confused me the most when I started using
> scipy.stats the most.

One related old idea (never implemented):
Splitting up the distributions pdf docs in tutorial into separate
pages for individual distributions, make them nicer with code and
graphs and link them from the docstring of the distribution.

This would keep the docstring itself from blowing up, but we could get
the full html reference if we need to.

Josef


>
>> * PEP8.  (Use the pep8 program to check the code.  I just got 1884 "errors"
>> when I ran 'pep8 --repeat distributions.py | wc -l'.)
>
> I test the code I contribute with pep8.py.
>
>
> Nicky
>>
>> Warren
>>
>>
>>> Nicky
>>>
>>> >
>>> > Thanks again for all your work on this - it's a very valuable
>>> > contribution.
>>> >     Jake
>>> >
>>> > On 09/15/2012 02:03 PM, nicky van foreest wrote:
>>> >> Hi,
>>> >>
>>> >> While reading distributions.py I made a kind of private trac list, of
>>> >> stuff that might need refactoring, As a matter of fact, all issues
>>> >> discussed in the mails above are already on my list. To summarize
>>> >> (Please don't take the list below as a complaint, but just factual. I
>>> >> am very happy that all this exists.)
>>> >>
>>> >> 1: the documentation is not clear, too concise, and fragmented;
>>> >> actually a bit messy.
>>> >>
>>> >> 2: there is code overlap in the check work (The lines Ralf mentioned)
>>> >> making it hard to find out the differences (but the differences in the
>>> >> check work are method dependent so I don't quite know how to tackle
>>> >> that in an elegant way),
>>> >>
>>> >> 3: the docs say that _argscheck need to be rewritten in case users
>>> >> build their own distribution. But then the minimal requirement in my
>>> >> opinion is that argscheck is simple to understand, and not overly
>>> >> generic as it is right now. (I also have examples that its output,
>>> >> while in line with its doc string, results in errors.) As far as I can
>>> >> see its core can simply be replaced by np.all(cond) (I did not test
>>> >> this though).
>>> >>
>>> >> 4: distributions.py is very big, too big for me actually. I recall
>>> >> that my first attempt at finding out how the stats stuff worked was to
>>> >> see how expon was implemented. No clue that this resided in
>>> >> distributions.py.
>>> >>
>>> >> What I would like to see, although that would require a considerable
>>> >> amount of work, is an architecture like this.
>>> >> 1 rv_generic.py containing generic stuff
>>> >> 2) rv_continous.py and rv_discrete.py, each imports rv_generic.
>>> >> 3) each distribution is covered in a separate file. like expon.py,
>>> >> norm, py, etc, and imports rv_continuous.py or rv_discrete.py,
>>> >> whatever appropriate. Each docstring can/should contain some generic
>>> >> part (like now) and a specific part, with working examples, and clear
>>> >> explanations. The most important are normal, expon, binom, geom,
>>> >> poisson, and perhaps some others. This would also enable others to
>>> >> help extend the documentation, examples....
>>> >> 4) I would like to move the math parts in continuous.rst to the doc
>>> >> string in the related distribution file.  Since mathjax gives such
>>> >> nice results on screen, there is also no reason not to include the
>>> >> mathematical facts in the doc string of the distribution itself. In
>>> >> fact, most (all?) distributions already have a short math description,
>>> >> but this is in overlap with continuous.rst.
>>> >>
>>> >> I wouldn't mind chopping up distributions.py into the separate
>>> >> distributions, and merge it with the maths of continuous.rst. I can
>>> >> tackle approx one distribution per day roughly, hence reduce this
>>> >> mind-numbing work to roughly 15 minutes a day (correction work on
>>> >> exams is much worse :-) ). But I don't know how much this proposal
>>> >> will affect the automatic generation of documentation. For the rest I
>>> >> don't think this will affect the code a lot.
>>> >>
>>> >>
>>> >>
>>> >> NIcky
>>> >>
>>> >>
>>> >>
>>> >>
>>> >>
>>> >> On 15 September 2012 11:59, Ralf Gommers <ralf.gommers at gmail.com>
>>> >> wrote:
>>> >>>
>>> >>> On Fri, Sep 14, 2012 at 10:56 PM, Jake Vanderplas
>>> >>> <vanderplas at astro.washington.edu> wrote:
>>> >>>> On 09/14/2012 01:49 PM, Ralf Gommers wrote:
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> On Fri, Sep 14, 2012 at 12:48 AM, <josef.pktd at gmail.com> wrote:
>>> >>>>> On Thu, Sep 13, 2012 at 5:21 PM, nicky van foreest
>>> >>>>> <vanforeest at gmail.com>
>>> >>>>> wrote:
>>> >>>>>> Hi,
>>> >>>>>>
>>> >>>>>> Now that I understand github (Thanks to Ralf for his explanations
>>> >>>>>> in
>>> >>>>>> Dutch) and got some simple stuff out of the way in distributions.py
>>> >>>>>> I
>>> >>>>>> would like to tackle a somewhat harder issue. The function
>>> >>>>>> argsreduce
>>> >>>>>> is, as far as I can see, too generic. I did some tests to see
>>> >>>>>> whether
>>> >>>>>> its most generic output, as described by its docstring, is actually
>>> >>>>>> swallowed by the callers of argsreduce, but this appears not to be
>>> >>>>>> the
>>> >>>>>> case.
>>> >>>>> being generic is not a disadvantage (per se) if it's fast
>>> >>>>>
>>> >>>>>
>>> >>>>> https://github.com/scipy/scipy/commit/4abdc10487d453b56f761598e8e013816b01a665
>>> >>>>> (and a being a one liner is not a disadvantage either)
>>> >>>>>
>>> >>>>> Josef
>>> >>>>>
>>> >>>>>> My motivation to simplify the code in distributions.py (and clean
>>> >>>>>> it
>>> >>>>>> up) is partly based on making it simpler to understand for myself,
>>> >>>>>> but
>>> >>>>>> also to  others. The fact that github makes code browsing a much
>>> >>>>>> nicer
>>> >>>>>> experience, perhaps more people will take a look at what's under
>>> >>>>>> the
>>> >>>>>> hood. But then the code should also be accessible and clean. Are
>>> >>>>>> there
>>> >>>>>> any reasons not to pursue this path, and focus on more important
>>> >>>>>> problems of the stats library?
>>> >>>>
>>> >>>> Not sure that argsreduce is the best place to start (see Josef's
>>> >>>> reply),
>>> >>>> but there should be things that can be done to make the code easier
>>> >>>> to read.
>>> >>>> For example, this code is used in ~10 methods of rv_continuous:
>>> >>>>
>>> >>>>          loc,scale=map(kwds.get,['loc','scale'])
>>> >>>>          args, loc, scale = self._fix_loc_scale(args, loc, scale)
>>> >>>>          x,loc,scale = map(asarray,(x,loc,scale))
>>> >>>>          args = tuple(map(asarray,args))
>>> >>>>
>>> >>>> Some refactoring may be in order. The same is true of the rest of the
>>> >>>> implementation of many of those methods. Some are exactly the same
>>> >>>> except
>>> >>>> for calls to the corresponding underscored method (example: logsf()
>>> >>>> and
>>> >>>> logcdf() are identical except for calls to _logsf() and _logcdf(),
>>> >>>> and one
>>> >>>> nonsensical multiplication).
>>> >>>>
>>> >>>> Ralf
>>> >>>>
>>> >>>>
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> SciPy-Dev mailing list
>>> >>>> SciPy-Dev at scipy.org
>>> >>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>> >>>>
>>> >>>> I would say that the most important improvement needed in
>>> >>>> distributions is
>>> >>>> in the documentation.
>>> >>>>
>>> >>>> A new user would look at the doc string of, say, scipy.stats.norm,
>>> >>>> and
>>> >>>> have no idea how to proceed.  Here's the current example from the
>>> >>>> docstring
>>> >>>> of scipy.stats.norm:
>>> >>>>
>>> >>>> Examples
>>> >>>> --------
>>> >>>>>>> from scipy.stats import norm
>>> >>>>>>> numargs = norm.numargs
>>> >>>>>>> [  ] = [0.9,] * numargs
>>> >>>>>>> rv = norm()
>>> >>>>>>> x = np.linspace(0, np.minimum(rv.dist.b, 3))
>>> >>>>>>> h = plt.plot(x, rv.pdf(x))
>>> >>>> I don't even know what that means... and it doesn't compile.  Also,
>>> >>>> what
>>> >>>> is b?  how would I enter mu and sigma to make a normal distribution?
>>> >>>> It's
>>> >>>> all pretty opaque.
>>> >>>
>>> >>> True, the examples are confusing. The reason is that they're generated
>>> >>> from
>>> >>> a template, and it's pretty much impossible to get clear and concise
>>> >>> examples that way. It would be better to write custom examples for the
>>> >>> most-used distributions, and refer to those from the others.
>>> >>>
>>> >>> Ralf
>>> >>>
>>> >>>
>>> >>>
>>> >>> _______________________________________________
>>> >>> SciPy-Dev mailing list
>>> >>> SciPy-Dev at scipy.org
>>> >>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>> >>>
>>> >> _______________________________________________
>>> >> SciPy-Dev mailing list
>>> >> SciPy-Dev at scipy.org
>>> >> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>> >
>>> >
>>> > _______________________________________________
>>> > SciPy-Dev mailing list
>>> > SciPy-Dev at scipy.org
>>> > http://mail.scipy.org/mailman/listinfo/scipy-dev
>>> _______________________________________________
>>> SciPy-Dev mailing list
>>> SciPy-Dev at scipy.org
>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
>>
>>
>> _______________________________________________
>> SciPy-Dev mailing list
>> SciPy-Dev at scipy.org
>> http://mail.scipy.org/mailman/listinfo/scipy-dev
>>
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev



More information about the SciPy-Dev mailing list