[SciPy-Dev] Proposal for Scikit-Signal - a SciPy toolbox for signal processing

Wed Jan 4 01:37:23 EST 2012

Hi Gael, 

Thanks for your email.  I appreciate the detailed response.   Please don't mis-interpret my distaste for the scikit namespace as anything more than organizational.   I'm very impressed with most of the scikits themselves:  scikit-learn being a particular favorite.    It is very clear to most people that smaller teams and projects is useful for diverse collaboration and very effective to involve more people in development.     This is all very good and I'm very encouraged by this development.  Even in the SciPy package itself, active development happens on only a few packages which have received attention from small teams.  

Of course, the end user wants integration so the more packages exist the more we need tools like EPD, ActivePython, Python(X,Y), and Sage (and corresponding repositories like CRAN).   The landscape is much better in this direction than it was earlier, but packaging and distribution is still a major weak-point in Python.   I think the scientific computing community should continue to just develop it's own packaging solutions.  I've been a big fan of David Cournapeau's work in this area (bento being his latest effort).   

Your vision of a bazaar model is a good one.  I just think we need to get scipy itself more into that model.   I agree it's useful to have a core-set of common functionality, but I am quite in favor of moving to a more tight-knit core for the main scipy package with additional scipy-*named* packages (e.g. scipy-odr), etc.   These can install directly into the scipy package infrastructure (or use whatever import mechanisms the distributions desire).    This move to more modular packages for SciPy itself, has been in my mind for a long time which is certainly why I see the scikits name-space as superfluous.   But, I understand that branding means something.  

So, my (off the top of my head) take on what should be core scipy is: 

fftpack
stats
io
special
optimize]
linalg
lib.blas
lib.lapack
misc

I think the other packages should be maintained, built and distributed as 

scipy-constants
scipy-integrate
scipy-cluster
scipy-ndimage
scipy-spatial
scipy-odr
scipy-sparse
scipy-maxentropy
scipy-signal
scipy-weave  (actually I think weave should be installed separately and/or merged with other foreign code integration tools like fwrap, f2py, etc.)

Then, we could create a scipy superpack to install it all together.     What issues do people see with a plan like this? 

Obviously it takes time and effort to do this.   But, I'm hoping to find time or sponsor people who will have time to do this work.   Thus, I'd like to have the conversation to find out what people think *should* be done.   There also may be students looking for a way to get involved or people interested in working on Google Summer of Code projects. 

Thanks, 

-Travis

On Jan 3, 2012, at 9:44 AM, Gael Varoquaux wrote:

> Hi Travis,
> 
> It is good that you are asking these questions. I think that they are
> important. Let me try to give my view on some of the points you raise.
> 
>> There are too many scikits already that should just be scipy projects
> 
> I used to think pretty much as you did: I don't want to have to depend on
> too many packages. In addition we are a community, so why so many
> packages? My initial vision when investing in the scikit-learn was that
> we would merge it back to scipy after a while. The dynamic of the project
> has changed a bit my way of seeing things, and I now think that it is a
> good thing to have scikits-like packages that are more specialized than
> scipy for the following reasons:
> 
> 1. Development is technically easier in smaller packages
> 
>    A developer working on a specific package does not need to tackle
>    complexity of the full scipy suite. Building can be made easier, as scipy
>    must (for good reasons) depend on Fortran and C++ packs. It is well known
>    that the complexity of developing a project grows super-linearly with the
>    number of lines of code.
> 
>    It's also much easier to achieve short release cycles. Short
>    release cycles are critical to the dynamic of a community-driven
>    project (and I'd like to thanks our current release manager, Ralf
>    Gommers, for his excellent work).
> 
> 2. Narrowing the application domain helps developers and users
> 
>    It is much easier to make entry points, in the code and in the
>    documentation, with a given application in mind. Also, best practices and
>    conventions may vary between communities. While this is (IMHO) one of the
>    tragedies of contemporary science, it such domain specialization
>    helps people feeling comfortable.
> 
>    Computational trade offs tend to be fairly specific to a given
>    context. For instance machine learning will more often be interested in
>    datasets with a large number of features and a (comparatively) small
>    number of samples, whereas in statistics it is the opposite. Thus the
>    same algorithm might be implemented differently. Catering for all needs
>    tends to make the code much more complex, and may confuse the user by
>    presenting him too many options.
> 
>    Developers cannot be expert in everything. If I specialize in machine
>    learning, and follow the recent developments in literature, chances are
>    that I do not have time to competitive in numerical integration. Having
>    too wide a scope in a project means that each developer understands well
>    a small fraction of the code. It makes things really hard for the release
>    manager, but also for day to day work, e.g. what to do with a new broken
>    test.
> 
> 3. It is easier to build an application-specific community
> 
>    An application specific library is easier to brand. One can tailor a
>    website, a user manual, and conference presentation or papers to an
>    application. As a result the project gains visibility in the community
>    of scientists and engineers it target.
> 
>    Also, having more focused mailing lists helps building enthusiasm, a they
>    have less volume, and are more focused on on questions that people
>    are interested in.
> 
>    Finally, a sad but true statement, is that people tend to get more credo
>    when working on an application-specific project than on a core layer.
>    Similarly, it is easier for me to get credit to fund development of an
>    application-specific project.
> 
> On a positive note, I would like to stress that I think that the
> scikit-learn has had a general positive impact on the scipy ecosystem,
> including for those who do not use it, or who do not care at all about
> machine learning. First, it is drawing more users in the community, and
> as a result, there is more interest and money flying around. But more
> importantly, when I look at the latest release of scipy, I see many of
> the new contributors that are also scikit-learn contributors (not only
> Fabian). This can be partly explained by the fact that getting involved
> in the scikit-learn was an easy and high-return-on-investment move for
> them, but they quickly grew to realize that the base layer could be
> improved. We have always had the vision to push in scipy any improvement
> that was general-enough to be useful across application domains.
> Remember, David Cournapeau was lured in the scipy business by working on
> the original scikit-learn. 
> 
>> Frankly, it makes me want to pull out all of the individual packages I
>> wrote that originally got pulled together into SciPy into separate
>> projects and develop them individually from there. 
> 
> What you are proposing is interesting, that said, I think that the
> current status quo with scipy is a good one. Having a core collection of
> numerical tools is, IMHO, a key element of the Python scientific
> community for two reasons:
> 
>  * For the user, knowing that he will find the answer to most of his
>    simple questions in a single library makes it easy to start. It also
>    makes it easier to document.
> 
>  * Different packages need to rely on a lot of common generic tools.
>    Linear algebra, sparse linear algebra, simple statistics and signal
>    processing, simple black-box optimizer, interpolation ND-image-like 
>    processing. Indeed You ask what package in scipy do people use.
>    Actually, in scikit-learn we use all sub-packages apart from
>    'integrate'. I checked, and we even use 'io' in one of the examples.
>    Any code doing high-end application-specific numerical computing will
>    need at least a few of the packages of scipy. Of course, a package
>    may need an optimizer tailored to a specific application, in which 
>    case they will roll there own, an this effort might be duplicated a 
>    bit. But having the common core helps consolidating the ecosystem.
> 
> So the setup that I am advocating is a core library, with many other
> satellite packages. Or rather a constellation of packages that use each
> other rather then a monolithic universe. This is a common strategy of
> breaking a package up into parts that can be used independently to make
> them lighter and hopefully ease the development of the whole. For
> instance, this is what was done to the ETS (Enthought Tool Suite). And we
> have all seen this strategy gone bad, for instance in the situation of
> 'dependency hell', in which case all packages start depending on each
> other, the installation becomes an issue and there is a grid lock of
> version-compatibility bugs. This is why any such ecosystem must have an
> almost tree-like structure in its dependency graph. Some packages must be
> on top of the graph, more 'core' than others, and as we descend the
> graph, packages can reduce their dependencies. I think that we have more
> or less this situation with scipy, and I am quite happy about it.
> 
> Now I hear your frustration when this development happens a bit in the
> wild with no visible construction of an ecosystem. This ecosystem does
> get constructed via the scipy mailing-lists, conferences, and in general
> the community, but it may not be very clear to the external observer. One
> reason why my group decided to invest in the scikit-learn was that it was
> the learning package that seemed the closest in terms of code and
> community connections. This was the virtue of the 'scikits' branding. For
> technical reasons, the different scikits have started getting rid of this
> namespace in the module import. You seem to think that the branding name
> 'scikits' does not reflect accurately the fact that they are tight
> members of the scipy constellationhile I must say that I am not a huge
> fan of the name 'scikits', we have now invested in it, and I don't think
> that we can easily move away. 
> 
> If the problem is a branding issue, it may be partly addressed with
> appropriate communication. A set of links across the different web pages
> of the ecosystem, and a central document explaining the relationships
> between the packages might help. But this idea is not completely new and
> it simply is waiting for someone to invest time in it. For instance,
> there was the project of reworking the scipy.org homepage.
> 
> Another important problem is the question of what sits 'inside' this
> collection of tools, and what is outside. The answer to this question
> will pretty much depend on who you ask. In practice, for the end user, it
> is very much conditioned by what meta-package they can download. EPD,
> Sage, Python(x,y), and many others give different answers.
> 
> To conclude, I'd like to stress that, in my eyes, what really matters is
> a solution that gives us a vibrant community, with a good production of
> quality code and documentation. I think that the current set of small
> projects makes it easier to gather developers and users, and that it
> work well as long as they talk to each other and do not duplicate too
> much each-other's functionality. If on top of that they are BSD-licensed
> and use numpy as their data model, I am a happy man.
> 
> What I am pushing for is a Bazar-like development model, in which it is
> easy for various approaches answering different needs to develop in
> parallel with different compromises. In such a context, I think that
> Jaidev could kick start a successful and useful scikit-signal. Hopefully
> this would not preclude improvements to the docs, examples, and existing
> code in scipy.signal.
> 
> Sorry for the long post, and thank you for reading.
> 
> Gael
> _______________________________________________
> SciPy-Dev mailing list
> SciPy-Dev at scipy.org
> http://mail.scipy.org/mailman/listinfo/scipy-dev