[SciPy-Dev] Proposal for Scikit-Signal - a SciPy toolbox for signal processing

Gael Varoquaux gael.varoquaux at normalesup.org
Tue Jan 3 10:44:22 EST 2012


Hi Travis,

It is good that you are asking these questions. I think that they are
important. Let me try to give my view on some of the points you raise.

> There are too many scikits already that should just be scipy projects

I used to think pretty much as you did: I don't want to have to depend on
too many packages. In addition we are a community, so why so many
packages? My initial vision when investing in the scikit-learn was that
we would merge it back to scipy after a while. The dynamic of the project
has changed a bit my way of seeing things, and I now think that it is a
good thing to have scikits-like packages that are more specialized than
scipy for the following reasons:

 1. Development is technically easier in smaller packages

    A developer working on a specific package does not need to tackle
    complexity of the full scipy suite. Building can be made easier, as scipy
    must (for good reasons) depend on Fortran and C++ packs. It is well known
    that the complexity of developing a project grows super-linearly with the
    number of lines of code.

    It's also much easier to achieve short release cycles. Short
    release cycles are critical to the dynamic of a community-driven
    project (and I'd like to thanks our current release manager, Ralf
    Gommers, for his excellent work).

 2. Narrowing the application domain helps developers and users

    It is much easier to make entry points, in the code and in the
    documentation, with a given application in mind. Also, best practices and
    conventions may vary between communities. While this is (IMHO) one of the
    tragedies of contemporary science, it such domain specialization
    helps people feeling comfortable.

    Computational trade offs tend to be fairly specific to a given
    context. For instance machine learning will more often be interested in
    datasets with a large number of features and a (comparatively) small
    number of samples, whereas in statistics it is the opposite. Thus the
    same algorithm might be implemented differently. Catering for all needs
    tends to make the code much more complex, and may confuse the user by
    presenting him too many options.

    Developers cannot be expert in everything. If I specialize in machine
    learning, and follow the recent developments in literature, chances are
    that I do not have time to competitive in numerical integration. Having
    too wide a scope in a project means that each developer understands well
    a small fraction of the code. It makes things really hard for the release
    manager, but also for day to day work, e.g. what to do with a new broken
    test.

 3. It is easier to build an application-specific community

    An application specific library is easier to brand. One can tailor a
    website, a user manual, and conference presentation or papers to an
    application. As a result the project gains visibility in the community
    of scientists and engineers it target.

    Also, having more focused mailing lists helps building enthusiasm, a they
    have less volume, and are more focused on on questions that people
    are interested in.

    Finally, a sad but true statement, is that people tend to get more credo
    when working on an application-specific project than on a core layer.
    Similarly, it is easier for me to get credit to fund development of an
    application-specific project.

On a positive note, I would like to stress that I think that the
scikit-learn has had a general positive impact on the scipy ecosystem,
including for those who do not use it, or who do not care at all about
machine learning. First, it is drawing more users in the community, and
as a result, there is more interest and money flying around. But more
importantly, when I look at the latest release of scipy, I see many of
the new contributors that are also scikit-learn contributors (not only
Fabian). This can be partly explained by the fact that getting involved
in the scikit-learn was an easy and high-return-on-investment move for
them, but they quickly grew to realize that the base layer could be
improved. We have always had the vision to push in scipy any improvement
that was general-enough to be useful across application domains.
Remember, David Cournapeau was lured in the scipy business by working on
the original scikit-learn. 

> Frankly, it makes me want to pull out all of the individual packages I
> wrote that originally got pulled together into SciPy into separate
> projects and develop them individually from there. 

What you are proposing is interesting, that said, I think that the
current status quo with scipy is a good one. Having a core collection of
numerical tools is, IMHO, a key element of the Python scientific
community for two reasons:

  * For the user, knowing that he will find the answer to most of his
    simple questions in a single library makes it easy to start. It also
    makes it easier to document.

  * Different packages need to rely on a lot of common generic tools.
    Linear algebra, sparse linear algebra, simple statistics and signal
    processing, simple black-box optimizer, interpolation ND-image-like 
    processing. Indeed You ask what package in scipy do people use.
    Actually, in scikit-learn we use all sub-packages apart from
    'integrate'. I checked, and we even use 'io' in one of the examples.
    Any code doing high-end application-specific numerical computing will
    need at least a few of the packages of scipy. Of course, a package
    may need an optimizer tailored to a specific application, in which 
    case they will roll there own, an this effort might be duplicated a 
    bit. But having the common core helps consolidating the ecosystem.

So the setup that I am advocating is a core library, with many other
satellite packages. Or rather a constellation of packages that use each
other rather then a monolithic universe. This is a common strategy of
breaking a package up into parts that can be used independently to make
them lighter and hopefully ease the development of the whole. For
instance, this is what was done to the ETS (Enthought Tool Suite). And we
have all seen this strategy gone bad, for instance in the situation of
'dependency hell', in which case all packages start depending on each
other, the installation becomes an issue and there is a grid lock of
version-compatibility bugs. This is why any such ecosystem must have an
almost tree-like structure in its dependency graph. Some packages must be
on top of the graph, more 'core' than others, and as we descend the
graph, packages can reduce their dependencies. I think that we have more
or less this situation with scipy, and I am quite happy about it.

Now I hear your frustration when this development happens a bit in the
wild with no visible construction of an ecosystem. This ecosystem does
get constructed via the scipy mailing-lists, conferences, and in general
the community, but it may not be very clear to the external observer. One
reason why my group decided to invest in the scikit-learn was that it was
the learning package that seemed the closest in terms of code and
community connections. This was the virtue of the 'scikits' branding. For
technical reasons, the different scikits have started getting rid of this
namespace in the module import. You seem to think that the branding name
'scikits' does not reflect accurately the fact that they are tight
members of the scipy constellationhile I must say that I am not a huge
fan of the name 'scikits', we have now invested in it, and I don't think
that we can easily move away. 

If the problem is a branding issue, it may be partly addressed with
appropriate communication. A set of links across the different web pages
of the ecosystem, and a central document explaining the relationships
between the packages might help. But this idea is not completely new and
it simply is waiting for someone to invest time in it. For instance,
there was the project of reworking the scipy.org homepage.

Another important problem is the question of what sits 'inside' this
collection of tools, and what is outside. The answer to this question
will pretty much depend on who you ask. In practice, for the end user, it
is very much conditioned by what meta-package they can download. EPD,
Sage, Python(x,y), and many others give different answers.

To conclude, I'd like to stress that, in my eyes, what really matters is
a solution that gives us a vibrant community, with a good production of
quality code and documentation. I think that the current set of small
projects makes it easier to gather developers and users, and that it
work well as long as they talk to each other and do not duplicate too
much each-other's functionality. If on top of that they are BSD-licensed
and use numpy as their data model, I am a happy man.

What I am pushing for is a Bazar-like development model, in which it is
easy for various approaches answering different needs to develop in
parallel with different compromises. In such a context, I think that
Jaidev could kick start a successful and useful scikit-signal. Hopefully
this would not preclude improvements to the docs, examples, and existing
code in scipy.signal.

Sorry for the long post, and thank you for reading.

Gael



More information about the SciPy-Dev mailing list