[SciPy-Dev] distributions.py

Sun Sep 16 14:34:04 EDT 2012

On Sun, Sep 16, 2012 at 1:17 PM, nicky van foreest <vanforeest at gmail.com>wrote:

> > One comment: I like your idea of splitting the generic code to a
> > separate file.  But I'd hesitate to create a separate file for each
> > distribution: that's a lot of files.  In my opinion, a good compromise
> > would be to create one file for continuous distributions, and one for
> > discrete.  All of this could be in a new "scipy.stats.distributions"
> > submodule, for the sake of code organization.
>
> I just responded to Josef. This proposal makes the most sense I guess.
>
> >
> > Also, I'd add one more item to your list: make sure all code is PEP8
> > compliant.  Sometimes the PEP8 guidelines can seem a bit cumbersome, but
> > they do make browsing and understanding code much easier.
>
> I'll check the pep8 documentation.
>
> I guess that improving the documentation is most important for the
> moment. Once this is done, we can go on with splitting
> distributions.py into two or three smaller files.
>
>

FWIW, I'm strongly in favor of the following:
* Split distributions.py into three pieces (generic, discrete, continuous).
* Fix the screwy docstrings of the distributions (see the example Jake
showed in a previous email).
* PEP8.  (Use the pep8 program to check the code.  I just got 1884 "errors"
when I ran 'pep8 --repeat distributions.py | wc -l'.)

Warren

 Nicky
>
> >
> > Thanks again for all your work on this - it's a very valuable
> contribution.
> >     Jake
> >
> > On 09/15/2012 02:03 PM, nicky van foreest wrote:
> >> Hi,
> >>
> >> While reading distributions.py I made a kind of private trac list, of
> >> stuff that might need refactoring, As a matter of fact, all issues
> >> discussed in the mails above are already on my list. To summarize
> >> (Please don't take the list below as a complaint, but just factual. I
> >> am very happy that all this exists.)
> >>
> >> 1: the documentation is not clear, too concise, and fragmented;
> >> actually a bit messy.
> >>
> >> 2: there is code overlap in the check work (The lines Ralf mentioned)
> >> making it hard to find out the differences (but the differences in the
> >> check work are method dependent so I don't quite know how to tackle
> >> that in an elegant way),
> >>
> >> 3: the docs say that _argscheck need to be rewritten in case users
> >> build their own distribution. But then the minimal requirement in my
> >> opinion is that argscheck is simple to understand, and not overly
> >> generic as it is right now. (I also have examples that its output,
> >> while in line with its doc string, results in errors.) As far as I can
> >> see its core can simply be replaced by np.all(cond) (I did not test
> >> this though).
> >>
> >> 4: distributions.py is very big, too big for me actually. I recall
> >> that my first attempt at finding out how the stats stuff worked was to
> >> see how expon was implemented. No clue that this resided in
> >> distributions.py.
> >>
> >> What I would like to see, although that would require a considerable
> >> amount of work, is an architecture like this.
> >> 1 rv_generic.py containing generic stuff
> >> 2) rv_continous.py and rv_discrete.py, each imports rv_generic.
> >> 3) each distribution is covered in a separate file. like expon.py,
> >> norm, py, etc, and imports rv_continuous.py or rv_discrete.py,
> >> whatever appropriate. Each docstring can/should contain some generic
> >> part (like now) and a specific part, with working examples, and clear
> >> explanations. The most important are normal, expon, binom, geom,
> >> poisson, and perhaps some others. This would also enable others to
> >> help extend the documentation, examples....
> >> 4) I would like to move the math parts in continuous.rst to the doc
> >> string in the related distribution file.  Since mathjax gives such
> >> nice results on screen, there is also no reason not to include the
> >> mathematical facts in the doc string of the distribution itself. In
> >> fact, most (all?) distributions already have a short math description,
> >> but this is in overlap with continuous.rst.
> >>
> >> I wouldn't mind chopping up distributions.py into the separate
> >> distributions, and merge it with the maths of continuous.rst. I can
> >> tackle approx one distribution per day roughly, hence reduce this
> >> mind-numbing work to roughly 15 minutes a day (correction work on
> >> exams is much worse :-) ). But I don't know how much this proposal
> >> will affect the automatic generation of documentation. For the rest I
> >> don't think this will affect the code a lot.
> >>
> >>
> >>
> >> NIcky
> >>
> >>
> >>
> >>
> >>
> >> On 15 September 2012 11:59, Ralf Gommers <ralf.gommers at gmail.com>
> wrote:
> >>>
> >>> On Fri, Sep 14, 2012 at 10:56 PM, Jake Vanderplas
> >>> <vanderplas at astro.washington.edu> wrote:
> >>>> On 09/14/2012 01:49 PM, Ralf Gommers wrote:
> >>>>
> >>>>
> >>>>
> >>>> On Fri, Sep 14, 2012 at 12:48 AM, <josef.pktd at gmail.com> wrote:
> >>>>> On Thu, Sep 13, 2012 at 5:21 PM, nicky van foreest <
> vanforeest at gmail.com>
> >>>>> wrote:
> >>>>>> Hi,
> >>>>>>
> >>>>>> Now that I understand github (Thanks to Ralf for his explanations in
> >>>>>> Dutch) and got some simple stuff out of the way in distributions.py
> I
> >>>>>> would like to tackle a somewhat harder issue. The function
> argsreduce
> >>>>>> is, as far as I can see, too generic. I did some tests to see
> whether
> >>>>>> its most generic output, as described by its docstring, is actually
> >>>>>> swallowed by the callers of argsreduce, but this appears not to be
> the
> >>>>>> case.
> >>>>> being generic is not a disadvantage (per se) if it's fast
> >>>>>
> >>>>>
> https://github.com/scipy/scipy/commit/4abdc10487d453b56f761598e8e013816b01a665
> >>>>> (and a being a one liner is not a disadvantage either)
> >>>>>
> >>>>> Josef
> >>>>>
> >>>>>> My motivation to simplify the code in distributions.py (and clean it
> >>>>>> up) is partly based on making it simpler to understand for myself,
> but
> >>>>>> also to  others. The fact that github makes code browsing a much
> nicer
> >>>>>> experience, perhaps more people will take a look at what's under the
> >>>>>> hood. But then the code should also be accessible and clean. Are
> there
> >>>>>> any reasons not to pursue this path, and focus on more important
> >>>>>> problems of the stats library?
> >>>>
> >>>> Not sure that argsreduce is the best place to start (see Josef's
> reply),
> >>>> but there should be things that can be done to make the code easier
> to read.
> >>>> For example, this code is used in ~10 methods of rv_continuous:
> >>>>
> >>>>          loc,scale=map(kwds.get,['loc','scale'])
> >>>>          args, loc, scale = self._fix_loc_scale(args, loc, scale)
> >>>>          x,loc,scale = map(asarray,(x,loc,scale))
> >>>>          args = tuple(map(asarray,args))
> >>>>
> >>>> Some refactoring may be in order. The same is true of the rest of the
> >>>> implementation of many of those methods. Some are exactly the same
> except
> >>>> for calls to the corresponding underscored method (example: logsf()
> and
> >>>> logcdf() are identical except for calls to _logsf() and _logcdf(),
> and one
> >>>> nonsensical multiplication).
> >>>>
> >>>> Ralf
> >>>>
> >>>>
> >>>>
> >>>> _______________________________________________
> >>>> SciPy-Dev mailing list
> >>>> SciPy-Dev at scipy.org
> >>>> http://mail.scipy.org/mailman/listinfo/scipy-dev
> >>>>
> >>>> I would say that the most important improvement needed in
> distributions is
> >>>> in the documentation.
> >>>>
> >>>> A new user would look at the doc string of, say, scipy.stats.norm, and
> >>>> have no idea how to proceed.  Here's the current example from the
> docstring
> >>>> of scipy.stats.norm:
> >>>>
> >>>> Examples
> >>>> --------
> >>>>>>> from scipy.stats import norm
> >>>>>>> numargs = norm.numargs
> >>>>>>> [  ] = [0.9,] * numargs
> >>>>>>> rv = norm()
> >>>>>>> x = np.linspace(0, np.minimum(rv.dist.b, 3))
> >>>>>>> h = plt.plot(x, rv.pdf(x))
> >>>> I don't even know what that means... and it doesn't compile.  Also,
> what
> >>>> is b?  how would I enter mu and sigma to make a normal distribution?
>  It's
> >>>> all pretty opaque.
> >>>
> >>> True, the examples are confusing. The reason is that they're generated
> from
> >>> a template, and it's pretty much impossible to get clear and concise
> >>> examples that way. It would be better to write custom examples for the
> >>> most-used distributions, and refer to those from the others.
> >>>
> >>> Ralf
> >>>
> >>>
> >>>
> >>> _______________________________________________
> >>> SciPy-Dev mailing list
> >>> SciPy-Dev at scipy.org
> >>> http://mail.scipy.org/mailman/listinfo/scipy-dev
> >>>
> >> _______________________________________________
> >> SciPy-Dev mailing list
> >> SciPy-Dev at scipy.org
> >> http://mail.scipy.org/mailman/listinfo/scipy-dev
> >
> >
> > _______________________________________________
> > SciPy-Dev mailing list
> > SciPy-Dev at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-dev
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/scipy-dev/attachments/20120916/488ef4d7/attachment.html>