[SciPy-User] SciPy-User Digest, Vol 96, Issue 55

Alacast alacast at gmail.com
Wed Aug 31 06:30:04 EDT 2011


Hilbert transform:
Padding with zeros to the next power of 2 sped it up greatly. Thanks! Is
there any reason hilbert doesn't do that automatically, then remove the
padding before returning the analytic signal?

On Mon, Aug 29, 2011 at 10:02 PM, <scipy-user-request at scipy.org> wrote:

> Send SciPy-User mailing list submissions to
>        scipy-user at scipy.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://mail.scipy.org/mailman/listinfo/scipy-user
> or, via email, send a message with subject or body 'help' to
>        scipy-user-request at scipy.org
>
> You can reach the person managing the list at
>        scipy-user-owner at scipy.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of SciPy-User digest..."
>
>
> Today's Topics:
>
>   1. Re: R vs Python for simple interactive data analysis
>      (josef.pktd at gmail.com)
>   2. Hilbert transform (Alacast)
>   3. Re: Hilbert transform (Robert Kern)
>   4. Re: Return variable value by function value (Kliment)
>   5. Re: R vs Python for simple interactive data analysis
>      (Christopher Jordan-Squire)
>   6. Re: R vs Python for simple interactive data analysis
>      (Christopher Jordan-Squire)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 29 Aug 2011 13:13:48 -0400
> From: josef.pktd at gmail.com
> Subject: Re: [SciPy-User] R vs Python for simple interactive data
>        analysis
> To: SciPy Users List <scipy-user at scipy.org>
> Message-ID:
>        <CAMMTP+D2iRfMH+be8yJF54s3B7BV2uQGz1EkW-8deSecuMaUqA at mail.gmail.com
> >
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Mon, Aug 29, 2011 at 12:59 PM,  <josef.pktd at gmail.com> wrote:
> > On Mon, Aug 29, 2011 at 11:42 AM, ?<josef.pktd at gmail.com> wrote:
> >> On Mon, Aug 29, 2011 at 11:34 AM, Christopher Jordan-Squire
> >> <cjordan1 at uw.edu> wrote:
> >>> On Mon, Aug 29, 2011 at 10:27 AM, ?<josef.pktd at gmail.com> wrote:
> >>>> On Mon, Aug 29, 2011 at 11:10 AM, Skipper Seabold <
> jsseabold at gmail.com> wrote:
> >>>>> On Mon, Aug 29, 2011 at 10:57 AM, Christopher Jordan-Squire
> >>>>> <cjordan1 at uw.edu> wrote:
> >>>>>> On Sun, Aug 28, 2011 at 2:54 PM, Skipper Seabold <
> jsseabold at gmail.com> wrote:
> >>>>>>> On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <
> bsouthey at gmail.com> wrote:
> >>>>>>>> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <
> wesmckinn at gmail.com> wrote:
> >>>>>>>>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
> >>>>>>>>> <jason-sage at creativetrax.com> wrote:
> >>>>>>>>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
> >>>>>>>>>>> This comparison might be useful to some people, so I stuck it
> up on a
> >>>>>>>>>>> github repo. My overall impression is that R is much stronger
> for
> >>>>>>>>>>> interactive data analysis. Click on the link for more details
> why,
> >>>>>>>>>>> which are summarized in the README file.
> >>>>>>>>>>
> >>>>>>>>>> ?From the README:
> >>>>>>>>>>
> >>>>>>>>>> "In fact, using Python without the IPython qtconsole is
> practically
> >>>>>>>>>> impossible for this sort of cut and paste, interactive analysis.
> >>>>>>>>>> The shell IPython doesn't allow it because it automatically adds
> >>>>>>>>>> whitespace on multiline bits of code, breaking pre-formatted
> code's
> >>>>>>>>>> alignment. Cutting and pasting works for the standard python
> shell,
> >>>>>>>>>> but then you lose all the advantages of IPython."
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> You might use %cpaste in the ipython normal shell to paste
> without it
> >>>>>>>>>> automatically inserting spaces:
> >>>>>>>>>>
> >>>>>>>>>> In [5]: %cpaste
> >>>>>>>>>> Pasting code; enter '--' alone on the line to stop.
> >>>>>>>>>> :if 1>0:
> >>>>>>>>>> : ? ?print 'hi'
> >>>>>>>>>> :--
> >>>>>>>>>> hi
> >>>>>>>>>>
> >>>>>>>>>> Thanks,
> >>>>>>>>>>
> >>>>>>>>>> Jason
> >>>>>>>>>>
> >>>>>>>>>> _______________________________________________
> >>>>>>>>>> SciPy-User mailing list
> >>>>>>>>>> SciPy-User at scipy.org
> >>>>>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> This strikes me as a textbook example of why we need an
> integrated
> >>>>>>>>> formula framework in statsmodels. I'll make a pass through when I
> get
> >>>>>>>>> a chance and see if there are some places where pandas would
> really
> >>>>>>>>> help out.
> >>>>>>>>
> >>>>>>>> We used to have a formula class is scipy.stats and I do not follow
> >>>>>>>> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it
> also
> >>>>>>>> had this (extremely flexible but very hard to comprehend). It was
> what
> >>>>>>>> I had argued was needed ages ago for statsmodel. But it needs a
> >>>>>>>> community effort because the syntax required serves multiple
> >>>>>>>> communities with different annotations and needs. That is also
> seen
> >>>>>>>> from the different approaches taken by the stats packages from
> S/R,
> >>>>>>>> SAS, Genstat (and those are just are ones I have used).
> >>>>>>>>
> >>>>>>>
> >>>>>>> We have held this discussion at _great_ length multiple times on
> the
> >>>>>>> statsmodels list and are in the process of trying to integrate
> >>>>>>> Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy)
> into
> >>>>>>> the statsmodels base.
> >>>>>>>
> >>>>>>>
> http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework
> >>>>>>>
> >>>>>>> and more recently
> >>>>>>>
> >>>>>>>
> https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931
> ?
> >>>>>>>
> >>>>>>> https://github.com/statsmodels/formula
> >>>>>>> https://github.com/statsmodels/charlton
> >>>>>>>
> >>>>>>> Wes and I made some effort to go through this at SciPy. From where
> I
> >>>>>>> sit, I think it's difficult to disentangle the data structures from
> >>>>>>> the formula implementation, or maybe I'd just prefer to finish
> >>>>>>> tackling the former because it's much more straightforward. So I'd
> >>>>>>> like to first finish the pandas-integration branch that we've
> started
> >>>>>>> and then focus on the formula support. This is on my (our, I
> hope...)
> >>>>>>> immediate long-term goal list. Then I'd like to come back to the
> >>>>>>> community and hash out the 'rules of the game' details for formulas
> >>>>>>> after we have some code for people to play with, which promises to
> be
> >>>>>>> "fun."
> >>>>>>>
> >>>>>>> https://github.com/statsmodels/statsmodels/tree/pandas-integration
> >>>>>>>
> >>>>>>> FWIW, I could also improve the categorical function to be much
> nicer
> >>>>>>> for the given examples (ie., take a list, drop a reference
> category),
> >>>>>>> but I don't know that it's worth it, because it's really just a
> >>>>>>> stop-gap and ideally users shouldn't have to rely on it. Thoughts
> on
> >>>>>>> more stop-gap?
> >>>>>>>
> >>>>>>
> >>>>>> I want more usability, but I agree that a stop-gap probably isn't
> the
> >>>>>> right way to go, unless it has things we'd eventually want anyways.
> >>>>>>
> >>>>>>> If I understand Chris' concerns, I think pandas + formula will go a
> >>>>>>> long way towards bridging the gap between Python and R usability,
> but
> >>>>>>
> >>>>>> Yes, I agree. pandas + formulas would go a long, long way towards
> more
> >>>>>> usability.
> >>>>>>
> >>>>>> Though I really, really want a scatterplot smoother (i.e., lowess)
> in
> >>>>>> statsmodels. I use it a lot, and the final part of my R file was
> >>>>>> entirely lowess. (And, I should add, that was the part people liked
> >>>>>> best since one of the main goals of the assignment was to generate
> >>>>>> nifty pictures that could be used to summarize the data.)
> >>>>>>
> >>>>>
> >>>>> Working my way through the pull requests. Very time poor...
> >>>>>
> >>>>>>> it's a large effort and there are only a handful (at best) of
> people
> >>>>>>> writing code -- Wes being the only one who's more or less "full
> time"
> >>>>>>> as far as I can tell. The 0.4 statsmodels release should be very
> >>>>>>> exciting though, I hope. I'm looking forward to it, at least. Then
> >>>>>>> there's only the small problem of building an infrastructure and
> >>>>>>> community like CRAN so we can have specialists writing and
> maintaining
> >>>>>>> code...but I hope once all the tools are in place this will seem
> much
> >>>>>>> less daunting. There certainly seems to be the right sentiment for
> it.
> >>>>>>>
> >>>>>>
> >>>>>> At the very least creating and testing models would be much simpler.
> >>>>>> For weeks I've been wanting to see if gmm is the same as gee by
> >>>>>> fitting both models to the same dataset, but I've been putting it
> off
> >>>>>> because I didn't want to construct the design matrices by hand for
> >>>>>> such a simple question. (GMM--Generalized Method of Moments--is a
> >>>>>> standard econometrics model and GEE--Generalized Estimating
> >>>>>> Equations--is a standard biostatics model. They're both
> >>>>>> generalizations of quasi-likelihood and appear very similar, but I
> >>>>>> want to fit some models to figure out if they're exactly the same.)
> >>>>
> >>>> Since GMM is still in the sandbox, the interface is not very polished,
> >>>> and it's missing some enhancements. I recommend asking on the mailing
> >>>> list if it's not clear.
> >>>>
> >>>> Note GMM itself is very general and will never be a quick interactive
> >>>> method. The main work will always be to define the moment conditions
> >>>> (a bit similar to non-linear function estimation, optimize.leastsq).
> >>>>
> >>>> There are and will be special subclasses, eg. IV2SLS, that have
> >>>> predefined moment conditions, but, still, it's up to the user do
> >>>> construct design and instrument arrays.
> >>>> And as far as I remember, the GMM/GEE package in R doesn't have a
> >>>> formula interface either.
> >>>>
> >>>
> >>> Both of the two gee packages in R I know of have formula interfaces.
> >>>
> >>> http://cran.r-project.org/web/packages/geepack/
> >>> http://cran.r-project.org/web/packages/gee/index.html
> >
> > This is very different from what's in GMM in statsmodels so far. The
> > help file is very short, so I'm mostly guessing.
> > It seems to be for (a subset) of generalized linear models with
> > longitudinal/panel covariance structures. Something like this will
> > eventually (once we get panel data models) ?as a special case of GMM
> > in statsmodels, assuming it's similar to what I know from the
> > econometrics literature.
> >
> > Most of the subclasses of GMM that I currently have, are focused on
> > instrumental variable estimation, including non-linear regression.
> > This should be expanded over time.
> >
> > But GMM itself is designed for subclassing by someone who wants to use
> > her/his own moment conditions, as in
> > http://cran.r-project.org/web/packages/gmm/index.html
> > or for us to implement specific models with it.
> >
> > If someone wants to use it, then I have to quickly add the options for
> > the kernels of the weighting matrix, which I keep postponing.
> > Currently there is only a truncated, uniform kernel that assumes
> > observations are order by time, but users can provide their own
> > weighting function.
> >
> > Josef
> >
> >>
> >> I have to look at this. I mixed up some acronyms, I meant GEL and GMM
> >> http://cran.r-project.org/web/packages/gmm/index.html
> >> the vignette was one of my readings, and the STATA description for GMM.
> >>
> >> I never really looked at GEE. (That's Skipper's private work so far.)
> >>
> >> Josef
> >>
> >>>
> >>> -Chris JS
> >>>
> >>>> Josef
> >>>>
> >>>>>>
> >>>>>
> >>>>> Oh, it's not *that* bad. I agree, of course, that it could be better,
> >>>>> but I've been using mainly Python for my work, including GMM and
> >>>>> estimating equations models (mainly empirical likelihood and
> >>>>> generalized maximum entropy) for the last ~two years.
> >>>>>
> >>>>> Skipper
> >>>>> _______________________________________________
> >>>>> SciPy-User mailing list
> >>>>> SciPy-User at scipy.org
> >>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>>>
> >>>> _______________________________________________
> >>>> SciPy-User mailing list
> >>>> SciPy-User at scipy.org
> >>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>>
> >>> _______________________________________________
> >>> SciPy-User mailing list
> >>> SciPy-User at scipy.org
> >>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>
> >>
> >
>
> just to make another point:
>
> Without someone adding mixed effects, hierachical, panel/longitudinal
> models, and .... it will not help to have a formula interface to them.
> (Thanks to Scott we will soon have survival)
>
> Josef
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 29 Aug 2011 18:38:09 +0100
> From: Alacast <alacast at gmail.com>
> Subject: [SciPy-User] Hilbert transform
> To: scipy-user at scipy.org
> Message-ID:
>        <CAGoRfgERZW8WqrUY3=UkkeQKz_ND5dspmyzahX4As=H9QvgU0A at mail.gmail.com
> >
> Content-Type: text/plain; charset="iso-8859-1"
>
> I'm doing some analyses on sets of real-valued time series in which I want
> to know the envelope/instantaneous amplitude of each series in the set.
> Consequently, I've been taking the Hilbert transform (using
> scipy.signal.hilbert), then taking the absolute value of the result.
>
> The problem is that sometimes this process is far too slow. These time
> series can have on the order of 10^5 to 10^6 data points, and the sets can
> have up to 128 time series. Some datasets have been taking an hour or hours
> to compute on a perfectly modern computing node (1TB of RAM, plenty of
> 2.27Ghz cores, etc.). Is this expected behavior?
>
> I learned that Scipy's Hilbert transform implementation uses FFT, and that
> Scipy's FFT implementation can run in O(n^2) time when the number of time
> points is prime. This happened in a few of my datasets, but I've now
> included a check and correction for that (drop the last data point, so now
> the number is even and consequently not prime). Still, I observe a good
> amount of variability in run times, and they are rather long. Thoughts?
>
> Thanks!
> -------------- next part --------------
> An HTML attachment was scrubbed...
> URL:
> http://mail.scipy.org/pipermail/scipy-user/attachments/20110829/7c59ef28/attachment-0001.html
>
> ------------------------------
>
> Message: 3
> Date: Mon, 29 Aug 2011 13:06:02 -0500
> From: Robert Kern <robert.kern at gmail.com>
> Subject: Re: [SciPy-User] Hilbert transform
> To: SciPy Users List <scipy-user at scipy.org>
> Message-ID:
>        <CAF6FJiswfUeJ_4f6xYNkd1d5bXoM5hYfWsfCSPyX3chzucfGkA at mail.gmail.com
> >
> Content-Type: text/plain; charset=UTF-8
>
> On Mon, Aug 29, 2011 at 12:38, Alacast <alacast at gmail.com> wrote:
> > I'm doing some analyses on sets of real-valued time series in which I
> want
> > to know the envelope/instantaneous amplitude of each series in the set.
> > Consequently, I've been taking the Hilbert transform (using
> > scipy.signal.hilbert), then taking the absolute value of the result.
> > The problem is that sometimes this process is far too slow. These time
> > series can have on the order of 10^5 to 10^6 data points, and the sets
> can
> > have up to 128 time series. Some datasets have been taking an hour or
> hours
> > to compute on a perfectly modern computing node (1TB of RAM, plenty of
> > 2.27Ghz cores, etc.). Is this expected behavior?
> > I learned that Scipy's Hilbert transform implementation uses FFT, and
> that
> > Scipy's FFT implementation can run in O(n^2) time when the number of time
> > points is prime. This happened in a few of my datasets, but I've now
> > included a check and correction for that (drop the last data point, so
> now
> > the number is even and consequently not prime). Still, I observe a good
> > amount of variability in run times, and they are rather long. Thoughts?
>
> Having N be prime is just the extreme case. Basically, the FFT
> recursively computes the DFT. It can only recurse on integral factors
> of N, so any prime factor M must be computed the slow way, taking
> O(M^2) steps. You probably have large prime factors sitting around. A
> typical approach is to pad your signal with 0s until the next power of
> 2 or other reasonably-factorable size.
>
> --
> Robert Kern
>
> "I have come to believe that the whole world is an enigma, a harmless
> enigma that is made terrible by our own mad attempt to interpret it as
> though it had an underlying truth."
> ? -- Umberto Eco
>
>
> ------------------------------
>
> Message: 4
> Date: Sun, 28 Aug 2011 16:04:25 +0200
> From: "Kliment" <otrov at hush.ai>
> Subject: Re: [SciPy-User] Return variable value by function value
> To: scipy-user at scipy.org
> Message-ID: <20110828140425.E64D9E6719 at smtp.hushmail.com>
> Content-Type: text/plain; charset="UTF-8"
>
> Thanks for your input guys
>
> So in similar cases I should use interpolation function (or solver
> depending on initial function) from SciPy package
>
> Example I provided was from scratch of course, but it seems that
> 0.95 is still in y range:
>
> >>> sqrt(1 - 98**2/10E+4)
> 0.95076811052958654
>
> >>> sqrt(1 - 99**2/10E+4)
> 0.94973154101567037
>
>
> Regards,
> Kliment
>
>
>
> ------------------------------
>
> Message: 5
> Date: Mon, 29 Aug 2011 15:55:08 -0500
> From: Christopher Jordan-Squire <cjordan1 at uw.edu>
> Subject: Re: [SciPy-User] R vs Python for simple interactive data
>        analysis
> To: SciPy Users List <scipy-user at scipy.org>
> Message-ID:
>        <CAEJxiFp+ev4ORv=2i1h9L2-zaVqT2xO8dp__8NLTdqWMQt_FMQ at mail.gmail.com
> >
> Content-Type: text/plain; charset=ISO-8859-1
>
> I've just pushed an updated version of the .r and .py files to github,
> as well as a summary of the corrections/suggestions from the mailing
> list. I'd appreciate any further comments/suggestions.
>
> Compared to the original .r and .py files, in these revised version:
> -The R code was cleaned up because I realized I didn't need to use
>    as.factor if I made the relevant variables into factors
> -The python code was cleaned up by computing the 'sub-design matrices'
>    associated with each factor variable before hand and stashing
>    them in a dictionary
> -Names were added to the variables in the regression by creating them
>    from the calls to sm.categorical and stashing them in a dictionary
>
> Notably, the helper fucntions and stashing of the pieces of design matrices
> simplified the calls for model fitting, but they didn't noticeably shorten
> the code. They also required a small increase in complexity. (In terms of
> the
> data structures and function calls used to create the list of names and
> the design matrices.)
>
> I also added some comments to the effect that:
> *one can use paste or cpaste in the IPython shell
> *np.set_printoptions or sm.iolib.SimpleTable can be used to help with
> printing of numpy arrays
> *names can be added by the user to regression model summaries
> *one can make helper functions to construct design matrices and keep
> track of names, but the simplest way of doing it isn't robust to
> subset-ing the data in the presence of categorical variables
>
> Did I miss anything?
>
> -Chris JS
>
>
> On Sat, Aug 27, 2011 at 1:19 PM, Christopher Jordan-Squire
> <cjordan1 at uw.edu> wrote:
> > Hi--I've been a moderately heavy R user for the past two years, so
> > about a month ago I took an (abbreviated) version of a simple data
> > analysis I did in R and tried to rewrite as much of it as possible,
> > line by line, into python using numpy and statsmodels. I didn't use
> > pandas, and I can't comment on how much it might have simplified
> > things.
> >
> > This comparison might be useful to some people, so I stuck it up on a
> > github repo. My overall impression is that R is much stronger for
> > interactive data analysis. Click on the link for more details why,
> > which are summarized in the README file.
> >
> > https://github.com/chrisjordansquire/r_vs_py
> >
> > The code examples should run out of the box with no downloads (other
> > than R, Python, numpy, scipy, and statsmodels) required.
> >
> > -Chris Jordan-Squire
> >
>
>
> ------------------------------
>
> Message: 6
> Date: Mon, 29 Aug 2011 16:03:00 -0500
> From: Christopher Jordan-Squire <cjordan1 at uw.edu>
> Subject: Re: [SciPy-User] R vs Python for simple interactive data
>        analysis
> To: SciPy Users List <scipy-user at scipy.org>
> Message-ID:
>        <CAEJxiFr60ekfHDw-aw3q5Ur4oFu0=ET04XRR+vb_O90Cf_rLdg at mail.gmail.com
> >
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Mon, Aug 29, 2011 at 12:13 PM,  <josef.pktd at gmail.com> wrote:
> > On Mon, Aug 29, 2011 at 12:59 PM, ?<josef.pktd at gmail.com> wrote:
> >> On Mon, Aug 29, 2011 at 11:42 AM, ?<josef.pktd at gmail.com> wrote:
> >>> On Mon, Aug 29, 2011 at 11:34 AM, Christopher Jordan-Squire
> >>> <cjordan1 at uw.edu> wrote:
> >>>> On Mon, Aug 29, 2011 at 10:27 AM, ?<josef.pktd at gmail.com> wrote:
> >>>>> On Mon, Aug 29, 2011 at 11:10 AM, Skipper Seabold <
> jsseabold at gmail.com> wrote:
> >>>>>> On Mon, Aug 29, 2011 at 10:57 AM, Christopher Jordan-Squire
> >>>>>> <cjordan1 at uw.edu> wrote:
> >>>>>>> On Sun, Aug 28, 2011 at 2:54 PM, Skipper Seabold <
> jsseabold at gmail.com> wrote:
> >>>>>>>> On Sat, Aug 27, 2011 at 10:15 PM, Bruce Southey <
> bsouthey at gmail.com> wrote:
> >>>>>>>>> On Sat, Aug 27, 2011 at 5:06 PM, Wes McKinney <
> wesmckinn at gmail.com> wrote:
> >>>>>>>>>> On Sat, Aug 27, 2011 at 5:03 PM, Jason Grout
> >>>>>>>>>> <jason-sage at creativetrax.com> wrote:
> >>>>>>>>>>> On 8/27/11 1:19 PM, Christopher Jordan-Squire wrote:
> >>>>>>>>>>>> This comparison might be useful to some people, so I stuck it
> up on a
> >>>>>>>>>>>> github repo. My overall impression is that R is much stronger
> for
> >>>>>>>>>>>> interactive data analysis. Click on the link for more details
> why,
> >>>>>>>>>>>> which are summarized in the README file.
> >>>>>>>>>>>
> >>>>>>>>>>> ?From the README:
> >>>>>>>>>>>
> >>>>>>>>>>> "In fact, using Python without the IPython qtconsole is
> practically
> >>>>>>>>>>> impossible for this sort of cut and paste, interactive
> analysis.
> >>>>>>>>>>> The shell IPython doesn't allow it because it automatically
> adds
> >>>>>>>>>>> whitespace on multiline bits of code, breaking pre-formatted
> code's
> >>>>>>>>>>> alignment. Cutting and pasting works for the standard python
> shell,
> >>>>>>>>>>> but then you lose all the advantages of IPython."
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> You might use %cpaste in the ipython normal shell to paste
> without it
> >>>>>>>>>>> automatically inserting spaces:
> >>>>>>>>>>>
> >>>>>>>>>>> In [5]: %cpaste
> >>>>>>>>>>> Pasting code; enter '--' alone on the line to stop.
> >>>>>>>>>>> :if 1>0:
> >>>>>>>>>>> : ? ?print 'hi'
> >>>>>>>>>>> :--
> >>>>>>>>>>> hi
> >>>>>>>>>>>
> >>>>>>>>>>> Thanks,
> >>>>>>>>>>>
> >>>>>>>>>>> Jason
> >>>>>>>>>>>
> >>>>>>>>>>> _______________________________________________
> >>>>>>>>>>> SciPy-User mailing list
> >>>>>>>>>>> SciPy-User at scipy.org
> >>>>>>>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>>>>>>>>>
> >>>>>>>>>>
> >>>>>>>>>> This strikes me as a textbook example of why we need an
> integrated
> >>>>>>>>>> formula framework in statsmodels. I'll make a pass through when
> I get
> >>>>>>>>>> a chance and see if there are some places where pandas would
> really
> >>>>>>>>>> help out.
> >>>>>>>>>
> >>>>>>>>> We used to have a formula class is scipy.stats and I do not
> follow
> >>>>>>>>> nipy (http://nipy.sourceforge.net/nipy/stable/index.html) as it
> also
> >>>>>>>>> had this (extremely flexible but very hard to comprehend). It was
> what
> >>>>>>>>> I had argued was needed ages ago for statsmodel. But it needs a
> >>>>>>>>> community effort because the syntax required serves multiple
> >>>>>>>>> communities with different annotations and needs. That is also
> seen
> >>>>>>>>> from the different approaches taken by the stats packages from
> S/R,
> >>>>>>>>> SAS, Genstat (and those are just are ones I have used).
> >>>>>>>>>
> >>>>>>>>
> >>>>>>>> We have held this discussion at _great_ length multiple times on
> the
> >>>>>>>> statsmodels list and are in the process of trying to integrate
> >>>>>>>> Charlton (from Nathaniel) and/or Formula (from Jonathan / NiPy)
> into
> >>>>>>>> the statsmodels base.
> >>>>>>>>
> >>>>>>>>
> http://statsmodels.sourceforge.net/dev/roadmap_todo.html#formula-framework
> >>>>>>>>
> >>>>>>>> and more recently
> >>>>>>>>
> >>>>>>>>
> https://groups.google.com/group/pystatsmodels/browse_thread/thread/a76ea5de9e96964b/fd85b80ae46c4931
> ?
> >>>>>>>>
> >>>>>>>> https://github.com/statsmodels/formula
> >>>>>>>> https://github.com/statsmodels/charlton
> >>>>>>>>
> >>>>>>>> Wes and I made some effort to go through this at SciPy. From where
> I
> >>>>>>>> sit, I think it's difficult to disentangle the data structures
> from
> >>>>>>>> the formula implementation, or maybe I'd just prefer to finish
> >>>>>>>> tackling the former because it's much more straightforward. So I'd
> >>>>>>>> like to first finish the pandas-integration branch that we've
> started
> >>>>>>>> and then focus on the formula support. This is on my (our, I
> hope...)
> >>>>>>>> immediate long-term goal list. Then I'd like to come back to the
> >>>>>>>> community and hash out the 'rules of the game' details for
> formulas
> >>>>>>>> after we have some code for people to play with, which promises to
> be
> >>>>>>>> "fun."
> >>>>>>>>
> >>>>>>>>
> https://github.com/statsmodels/statsmodels/tree/pandas-integration
> >>>>>>>>
> >>>>>>>> FWIW, I could also improve the categorical function to be much
> nicer
> >>>>>>>> for the given examples (ie., take a list, drop a reference
> category),
> >>>>>>>> but I don't know that it's worth it, because it's really just a
> >>>>>>>> stop-gap and ideally users shouldn't have to rely on it. Thoughts
> on
> >>>>>>>> more stop-gap?
> >>>>>>>>
> >>>>>>>
> >>>>>>> I want more usability, but I agree that a stop-gap probably isn't
> the
> >>>>>>> right way to go, unless it has things we'd eventually want anyways.
> >>>>>>>
> >>>>>>>> If I understand Chris' concerns, I think pandas + formula will go
> a
> >>>>>>>> long way towards bridging the gap between Python and R usability,
> but
> >>>>>>>
> >>>>>>> Yes, I agree. pandas + formulas would go a long, long way towards
> more
> >>>>>>> usability.
> >>>>>>>
> >>>>>>> Though I really, really want a scatterplot smoother (i.e., lowess)
> in
> >>>>>>> statsmodels. I use it a lot, and the final part of my R file was
> >>>>>>> entirely lowess. (And, I should add, that was the part people liked
> >>>>>>> best since one of the main goals of the assignment was to generate
> >>>>>>> nifty pictures that could be used to summarize the data.)
> >>>>>>>
> >>>>>>
> >>>>>> Working my way through the pull requests. Very time poor...
> >>>>>>
> >>>>>>>> it's a large effort and there are only a handful (at best) of
> people
> >>>>>>>> writing code -- Wes being the only one who's more or less "full
> time"
> >>>>>>>> as far as I can tell. The 0.4 statsmodels release should be very
> >>>>>>>> exciting though, I hope. I'm looking forward to it, at least. Then
> >>>>>>>> there's only the small problem of building an infrastructure and
> >>>>>>>> community like CRAN so we can have specialists writing and
> maintaining
> >>>>>>>> code...but I hope once all the tools are in place this will seem
> much
> >>>>>>>> less daunting. There certainly seems to be the right sentiment for
> it.
> >>>>>>>>
> >>>>>>>
> >>>>>>> At the very least creating and testing models would be much
> simpler.
> >>>>>>> For weeks I've been wanting to see if gmm is the same as gee by
> >>>>>>> fitting both models to the same dataset, but I've been putting it
> off
> >>>>>>> because I didn't want to construct the design matrices by hand for
> >>>>>>> such a simple question. (GMM--Generalized Method of Moments--is a
> >>>>>>> standard econometrics model and GEE--Generalized Estimating
> >>>>>>> Equations--is a standard biostatics model. They're both
> >>>>>>> generalizations of quasi-likelihood and appear very similar, but I
> >>>>>>> want to fit some models to figure out if they're exactly the same.)
> >>>>>
> >>>>> Since GMM is still in the sandbox, the interface is not very
> polished,
> >>>>> and it's missing some enhancements. I recommend asking on the mailing
> >>>>> list if it's not clear.
> >>>>>
> >>>>> Note GMM itself is very general and will never be a quick interactive
> >>>>> method. The main work will always be to define the moment conditions
> >>>>> (a bit similar to non-linear function estimation, optimize.leastsq).
> >>>>>
> >>>>> There are and will be special subclasses, eg. IV2SLS, that have
> >>>>> predefined moment conditions, but, still, it's up to the user do
> >>>>> construct design and instrument arrays.
> >>>>> And as far as I remember, the GMM/GEE package in R doesn't have a
> >>>>> formula interface either.
> >>>>>
> >>>>
> >>>> Both of the two gee packages in R I know of have formula interfaces.
> >>>>
> >>>> http://cran.r-project.org/web/packages/geepack/
> >>>> http://cran.r-project.org/web/packages/gee/index.html
> >>
> >> This is very different from what's in GMM in statsmodels so far. The
> >> help file is very short, so I'm mostly guessing.
> >> It seems to be for (a subset) of generalized linear models with
> >> longitudinal/panel covariance structures. Something like this will
> >> eventually (once we get panel data models) ?as a special case of GMM
> >> in statsmodels, assuming it's similar to what I know from the
> >> econometrics literature.
> >>
> >> Most of the subclasses of GMM that I currently have, are focused on
> >> instrumental variable estimation, including non-linear regression.
> >> This should be expanded over time.
> >>
> >> But GMM itself is designed for subclassing by someone who wants to use
> >> her/his own moment conditions, as in
> >> http://cran.r-project.org/web/packages/gmm/index.html
> >> or for us to implement specific models with it.
> >>
> >> If someone wants to use it, then I have to quickly add the options for
> >> the kernels of the weighting matrix, which I keep postponing.
> >> Currently there is only a truncated, uniform kernel that assumes
> >> observations are order by time, but users can provide their own
> >> weighting function.
> >>
> >> Josef
> >>
> >>>
> >>> I have to look at this. I mixed up some acronyms, I meant GEL and GMM
> >>> http://cran.r-project.org/web/packages/gmm/index.html
> >>> the vignette was one of my readings, and the STATA description for GMM.
> >>>
> >>> I never really looked at GEE. (That's Skipper's private work so far.)
> >>>
> >>> Josef
> >>>
> >>>>
> >>>> -Chris JS
> >>>>
> >>>>> Josef
> >>>>>
> >>>>>>>
> >>>>>>
> >>>>>> Oh, it's not *that* bad. I agree, of course, that it could be
> better,
> >>>>>> but I've been using mainly Python for my work, including GMM and
> >>>>>> estimating equations models (mainly empirical likelihood and
> >>>>>> generalized maximum entropy) for the last ~two years.
> >>>>>>
> >>>>>> Skipper
> >>>>>> _______________________________________________
> >>>>>> SciPy-User mailing list
> >>>>>> SciPy-User at scipy.org
> >>>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>>>>
> >>>>> _______________________________________________
> >>>>> SciPy-User mailing list
> >>>>> SciPy-User at scipy.org
> >>>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>>>
> >>>> _______________________________________________
> >>>> SciPy-User mailing list
> >>>> SciPy-User at scipy.org
> >>>> http://mail.scipy.org/mailman/listinfo/scipy-user
> >>>>
> >>>
> >>
> >
> > just to make another point:
> >
> > Without someone adding mixed effects, hierachical, panel/longitudinal
> > models, and .... it will not help to have a formula interface to them.
> > (Thanks to Scott we will soon have survival)
> >
>
> I don't think I understand.
>
> I assumed that the formula framework is essentially orthogonal to the
> models themselves. In the sense that it should be simple to adapt a
> formula framework to new models. At least if they're some variety of
> linear model, and provided the formula framework is designed to allow
> for grouping syntax from the beginning. I think easy of extension to
> new models is a major goal, in fact, since we want it to be easy for
> people to contribute new models.
>
> -Chris JS
>
>
> > Josef
> > _______________________________________________
> > SciPy-User mailing list
> > SciPy-User at scipy.org
> > http://mail.scipy.org/mailman/listinfo/scipy-user
> >
>
>
> ------------------------------
>
> _______________________________________________
> SciPy-User mailing list
> SciPy-User at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-user
>
>
> End of SciPy-User Digest, Vol 96, Issue 55
> ******************************************
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.scipy.org/pipermail/scipy-user/attachments/20110831/018c9a2e/attachment.html>


More information about the SciPy-User mailing list